• Latest News below, see the tabs above for more details of things I do.

Power analysis for a between-sample experiment

Understanding statistical power is essential if you want to avoid wasting your time in psychology. The power of an experiment is its sensitivity – the likelihood that, if the effect tested for is real, your experiment will be able to detect it.

Statistical power is determined by the type of statistical test you are doing, the number of people you test and the effect size. The effect size is, in turn, determined by the reliability of the thing you are measuring, and how much it is pushed around by whatever you are manipulating.

Since it is a common test, I’ve been doing a power analysis for a two-sample (two-sided) t-test, for small, medium and large effects (as conventionally defined). The results should worry you.


This graph shows you how many people you need in each group for your test to have 80% power (a standard desirable level of power – meaning that if your effect is real you’ve an 80% chance of detecting it).

Things to note:

  • even for a large (0.8) effect you need close to 30 people (total n = 60) to have 80% power
  • for a medium effect (0.5) this is more like 70 people (total n = 140)
  • the required sample size increases drammatically as effect size drops
  • for small effects, the sample required for 80% is around 400 in each group (total n = 800).

What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.

Implications for anyone planning an experiment:

  • Is your effect very strong? If so, you may rely on a smaller sample (For illustrative purposes the effect size of male-female heigh difference is ~1.7, so large enough to detect with small sample. But if your effect is this obvious, why do you need an experiment?)
  • You really should prefer within-sample analysis, whenever possible (power analysis of this left as an exercise)
  • You can get away with smaller samples if you make your measure more reliable, or if you make your manipulation more impactful. Both of these will increase your effect size, the first by narrowing the variance within each group, the second by increasing the distance between them

Technical note: I did this cribbing code from Rob Kabacoff’s helpful page on power analysis. Code for the graph shown here is here. I use and recommend Rstudio.

Posted in Research, Teaching | Comments closed

New grant: ‘Neuroimaging as a marker of Attention Deficit Hyperactivity Disorder (ADHD)’

We have been awarded ~£11k by the White Rose Collaboration Fund. This will allow us to carry out a small neuroimaging study investigating brain activity associated with higher levels of ADHD traits. The collaboration combines expertise and facilities across the Universities of Sheffield, Leeds and York. Paul Overton has previously proposed that the subcortical area known as the superior colliculus may be crucial in ADHD. This is the focus of Maria’s PhD thesis (co-supervised by Paul and me). Jaclyn Billington from Leeds has experience imaging the colliculus, and Tony Morland is the deputy director of York’s neuroimaging facility (as well as having a wealth of experience imaging the areas associated with visual function). Alex Wade and Jeff Delvenne provide additional expertise in visual attention. I lead the project.

Here is the blurb:

We will create a unique network of expertise, personnel and facilities from across the WR network in order to establish a novel biomarker of Attention Deficit Hyperactivity Disorder (ADHD).

Despite a high prevalence (up to 10% of children by some estimates), ADHD remains controversial in terms diagnosis and treatment. Using brain scanning, this network aims to establish a biological marker common to all ADHD suffers. Such a biomarker could revolutionise our response to ADHD, allowing us to better understand the condition, diagnose earlier, manage the symptoms and target pharmacological interventions. This could potentially alleviate suffering and improve function for millions.

Theoretical direction for this proposal arises from Overton’s recent proposal that a core dysfunction in ADHD is hypersensitivity of the Superior Colliculus (SC), a key subcortical brain region known to play a critical role in attention, spatial orientation and saccadic eye movements. The development of this ‘collicular hypersensitivity’ hypothesis was possible because of the tradition of research into the fundamental neuroscience of subcortical structures at Sheffield.

This hypothesis has been taken forward by Stafford (Sheffield) who, with Panagiotidi, has been developing behavioural tests of collicular sensitivity. Early results show that healthy adults who are high and low on ADHD traits differ in these behavioural measures. However, behavioural tests are limited in that they cannot provide definitive insight into the neural basis of function. Teams in York and Leeds provide expertise in functional brain imaging and the neural basis of attention which would allow the direct translation of the Sheffield research programme into a test of a biomarker for ADHD.

Our primary objective will be to test two groups, high and low in ADHD traits for collicular responsiveness, using fMRI brain imaging. This testing will use behavioural measures which have been shown to discriminate the two groups, and analytic and imaging expertise from the Leeds and York based applicants in order to determine collicular responsiveness

Posted in Projects | Comments closed

Event: Crowdsourcing Psychology Data – Online, Mobile and Big Data approaches

StaffordFig3Smart phones, social media and networked sensors in everything from trains to toasters – The spread of digital technology creates new opportunities for cognitive scientists. Collecting and analysing the resulting “big data” also poses its own special challenges. This afternoon of talks and discussion is suitable for anyone curious about novel data collection and analysis strategies and how they can be deployed in psychological and behavioural research.

Time: 1pm-5pm, 11th of November 2014

Venue: Department of Psychology, University of Sheffield

We have four speakers followed by a panel discussion. Our speakers:

Martin Thirkettle: “Taking cognitive psychology to the small screen: Making a research focussed mobile app”

Developing a mobile app involves balancing a number of parties – researchers, funders, ethics committees, app developers, not to mention the end users. As the Open University’s “Brainwave” app, our first research-focussed cognitive psychology app, nears launch, I will discuss some of the challenges we’ve faced during the development process.

Caspar Addyman: “Measuring drug use with smartphones: Some misadventures”

Everyday drug use and its effects are not easily captured by lab or survey-based research. I developed the Boozerlyzer, an app that let people log their alcohol intake, their mood and play simple games that measured their cognitive and emotional responses. Although this had its flaws it led to a NHS funded collaboration to develop a simple smartphone tracker for Parkinson’s patients. Which was also problematic..

Robb Rutledge: “Crowdsourcing the cognitive science of decision making and well-being”

Some cognitive science questions can be particularly difficult to address in the lab. I will discuss results from The Great Brain Experiment, an app that allowed us to develop computational models for how decision making changes across the lifespan, and also how rewards and expectations relate to subjective well-being.

Andy Woods: “[C]lick your screen: probing the senses online”

We are at the cusp of some far-reaching technological advances that will be of tremendous benefit to research. Within a few short years we will be able to test thousands of people from any demographic with ‘connected’ technology every bit as good as we use in our labs today — indeed perhaps more so. Here I discuss on-web versus in-lab, predicted technological advances and issues with online research.

Tickets are free and available: here.

Posted in events | Comments closed

New grant: Reduced habitual intrusions : an early marker for Parkinson’s Disease?

SurprisalDensityPlotFor4CharacterWindowI have very pleased to announce that the Michael J Fox Foundation have funded a project I lead titled ‘Reduced habitual intrusions : an early marker for Parkinson’s Disease?’. The project is for 1 year, and is a collaboration between a psychologist (myself), a neuroscientist (Pete Redgrave), a clinician specialising in Parkinson’s (Jose Obeso, in Spain) and a computational linguist (Colin Bannard, in Liverpool). Mariana Leriche will be joining us a post-doc.

The idea of the project stems from hypothesis that Parkinson’s Disease will be specifically characterised by a loss of habitual control in the motor system. This was proposed by Pete, Jose and others in 2010. Since my PhD I’ve been interested automatic processes in behaviour. One phenomenon which seems to offer particular promise for exploring the interaction between habits and deliberate control is the ‘action slip’. This is an error where a habit intrudes into the normal stream of intentional action – for example, such as when you put the cereal in to the fridge, or when someone greets you by asking “Isn’t it a nice day?” and you say “I’m fine thank you”. An interesting prediction of the Redgrave et al theory is people with Parkinson’s should make fewer action slips (in contrast to all other types of movement errors, which you would expect to increase as the disease progresses).

The domain we’re going to look at this in is typing, which I’ve worked with before, and which – I’ve argued – is a great domain for looking at how skill, intention and habit combine in an everyday task which generates lots of easily coded data.

I feel the project reflects exactly the kind of work I aspire to do – cognitive science which uses precise behavioural measurement, informed by both neuroscientific and computational perspectives, and in the service of am ambitious but valuable goal. Now, of course, we actually have to get on and do it.

Posted in Projects, Research | Comments closed

Teaching: what it means to be critical

3282473832_cb97c4e525_mWe often ask students to ‘critically assess’ research, but we probably don’t explain what we mean by this as well as we could. Being ‘critical’ doesn’t mean merely criticising, just as skepticism isn’t the same as cynicism. A cynic thinks everything is worthless, regardless of the evidence; a skeptic wants to be persuaded of the value of things, but needs to understand the evidence first.

When we ask students to critically assess something we want them to do it as skeptics. You’re allowed to praise, as well as blame, a study, but it is important that you explain why.

As a rule of thumb, I distinguish three levels of criticism. These are the kinds of critical thinking that you might include at the end of a review or a final year project, under a “flaws and limitations” type-heading. Taking the least value first (and the one that will win you the least marks), let’s go through the three types one by one:

General criticisms: These are the sorts of flaws that we’re taught to look out for from the very first moment we start studying psychology. Things like too few participants, lack of ecological validity or the study being carried out on a selective population (such as university psychology students). The problem isn’t that these aren’t flaws of many studies, but rather that they are flaws of too many studies. Because these things are almost always true – we’d always like to have more people in our study! we’re never certain if our results will generalise to other populations – it isn’t very interesting to point this out. Far better if you can make …

Specific criticisms: These are things which are specific weakness of the study you are critiquing. Things which you might say as a general criticism become specific criticisms if you can show how they relate to particular weaknesses of a study. So, for example, almost all studies would benefit from more participants (a general criticism), but if you are looking at a study where the experiment and the control group differed on the dependent variable, but the result was non-significant (p=0.09 say), then you can make the specific criticism that the study is under-powered. The numbers tested, and the statistics used, mean that it isn’t possible to resolve either way that there probably is or probably isn’t an effect. It’s simply uncertain. So, they need to try again with more people (or less noise in their measures).

Finding specific criticisms means thinking hard about the logic of how the measures taken relate to psychological concepts (operationalisation) and what the comparisons made (control groups) really mean. A good specific criticism will be particular to the details of the study, showing that you’ve thought about the logic of how an experiment relates to the theoretical claims being considered (that’s why you get more credit for making this kind of criticisms). Specific criticism are good, but even better are…

Specific criticisms with crucial tests or suggestions: This means identifying a flaw in the experiment, or a potential alternative explanation, and simultaneously suggesting how the flaw can be remedied or the alternative explanation can be assessed for how likely it is. This is the hardest to do, because it is the most interesting. If you can do this well you can use existing information (the current study, and its results) to enhance our understanding of what is really true, and to guide our research so we can ask more effective questions next time. Exciting stuff!

Let me give an example. A few years ago I ran course which used a wiki (reader edited webpages) to help the students organise their study. At the end of the course I thought I’d compare the final exam scores of people who used the wiki against those who hadn’t. Surprise: people who used the wiki got better exam scores. An interesting result, I thought, which could suggest that using the wiki helped people understand the material. Next, I imagined I’d written this up as a study and then imagined the criticisms you could make of it. Obviously the major one is that it is observational rather than experimental (there is no control group), but why is this a problem? It’s a problem because there could be all sorts of differences between students which might mean they both score well on the exam and use the wiki more. One way this could manifest is that diligent students used the wiki more, but they also studied harder, and so got better marks because of that. But this criticism can be tested using the existing data. We can look and see if only highly grading students use the wiki. They don’t – there is a spread of students who score well and who score badly, independently of whether they use the wiki or not. In both groups, the ones who use the wiki more score better. This doesn’t settle the matter (we still need to run a randomised control study), but it allows us to finesse our assessment of one criticism (that only good students used the wiki). There are other criticisms (and other checks), you can read about it in the paper we eventually published on the topic.

Overall, you get credit in a critical assessment for showing that you are able to assess the plausibility of the various flaws a study has. You don’t get marks just for identifying as many flaws as possible without balancing them against the merits of the study. All studies have flaws, the interesting thing is to make positive suggestions about what can be confidently learnt from a study, whilst noting the most important flaws, and – if possible – suggesting how they could be dismissed or corrected.

Posted in Teaching | Comments closed

New paper: wiki users get higher exam scores

Just out in Research in Learning Technology, is our paper Students’ engagement with a collaborative wiki tool predicts enhanced written exam performance. This is an observational study which tries to answer the question of how students on my undergraduate cognitive psychology course can improve their grades.

One of the great misconceptions about sudying is that you just need to learn the material. Courses and exams which encourage regurgitation don’t help. In fact, as well as memorising content, you also need to understand it and reflect that understanding in writing. That is what the exam tests (and what an undergraduate education should test, in my opinion). A few years ago I realised, marking exams, that many students weren’t fulfilling their potential to understand and explain, and were relying too much on simply recalling the lecture and textbook content.

To address this, I got rid of the textbook for my course and introduced a wiki – an editable set of webpages, using which the students would write their own textbook. An inspiration for this was a quote from Francis Bacon:

Reading maketh a full man,
conference a ready man,
and writing an exact man.

(the reviewers asked that I remove this quote from the paper, so it has to go here!)

Each year I cleared the wiki and encouraged the people who took the course to read, write and edit using the wiki. I also kept a record of who edited the wiki, and their final exam scores.

The paper uses this data to show that people who made more edits to the wiki scored more highly on the exam. The obvious confound is that people who score more highly on exams will also be the ones who edit the wiki more. We tried to account for this statistically by including students’ scores on their other psychology exams in our analysis. This has the effect – we argue – of removing the general effect of students’ propensity to enjoy psychology and study hard and isolate the additional effect of using the wiki on my particular course.

The result, pleasingly, is that students who used the wiki more scored better on the final exam, even accounting for their general tendancy to score well on exams (as measured by grades for other courses). This means that even among people who generally do badly in exams, and did badly on my exam, those who used the wiki more did better. This is evidence that the wiki is beneficial for everyone, not just people who are good at exams and/or highly motivated to study.

Here’s the graph, Figure 1 from our paper:


This is a large effect – the benefit is around 5 percentage points, easily enough to lift you from a mid 2:2 to a 2:1, or a mid 2:1 to a first.

Fans of wiki research should check out this recent paper Wikipedia Classroom Experiment: bidirectional benefits ofstudents’ engagement in online production communities, which explores potential wider benefits of using wiki editing in the classroom. Our paper is unique for focussing on the bottom line of final course grades, and for trying to address the confound that students who work harder at psychology are likely to both get higher exam scores and use the wiki more.

The true test of the benefit of the wiki would be an experimental intervention where one group of students used a wiki and another did something else. For a discussion of this, and discussion of why we believe editing a wiki is so useful for learning, you’ll have to read the paper.

Thanks go to my collaborators. Harriet reviewed the literature and Herman instaled the wiki for me, and did the analysis. Together we discussed the research and wrote the paper.

Full citation:
Stafford, T., Elgueta, H., Cameron, H. (2014). Students’ engagement with a collaborative wiki tool predicts enhanced written exam performance. Research in Learning Technology, 22, 22797. doi:10.3402/rlt.v22.22797

Posted in Research, Teaching | Comments closed

New paper: Performance breakdown effects dissociate from error detection effects in typing

This is the first work on typing that has come out of C’s PhD thesis. C’s idea, which inspired his PhD, was that typing would be an interesting domain to look at errors and error monitoring. Unlike most discrete trial tasks which have been used to look at errors, typing is a continuous performance task (some of subjects can type over 100 words per minutes, pressing around 10 keys a second!). Futhermore the response you make to signal an error is highly practiced – you press the backspace. Previous research on error signalling hasn’t been able to distinguished between effects due to the error and effects due having to make an unpracticed response to signal that you know you made the error.

For me, typing is a fascinating domain which contradicts some notions of how actions are learnt. The dichotomy between automatic and controlled processing doesn’t obviously apply to typing, which is rapid and low effort (like habits), but flexible and goal-orientated (like controlled processes). A great example of how typing can be used to investigate the complexity of action control comes from this recent paper by Gordan Logan and Matthew Crump (this).

In this paper, we asked skilled touch-typists to copy type some set sentences and analysed the speed of typing before, during and after errors. We found, in contrast to some previous work which had used unpracticed discrete trial tasks to study errors, that there was no change in speed before an error. We did find, however, that typing speeds before errors did increase in variability – something we think signals a loss of control, something akin to slipping “out of the zone” of concentration. A secondary analysis compared errors which participants corrected against those they didn’t correct (and perhaps didn’t even notice they made). This gave us evidence that performance breakdown before an error isn’t just due to the processes that notice and correct errors, but – at least to the extent that error correction is synonymous with error detection – performance breakdown occurs independently of error monitoring.

Here’s the abstract

Mistakes in skilled performance are often observed to be slower than correct actions. This error slowing has been associated with cognitive control processes involved in performance monitoring and error detection. A limited literature on skilled actions, however, suggests that preerror actions may also be slower than accurate actions. This contrasts with findings from unskilled, discrete trial tasks, where preerror performance is usually faster than accurate performance. We tested 3 predictions about error-related behavioural changes in continuous typing performance. We asked participants to type 100 sentences without visual feedback. We found that (a) performance before errors was no different in speed than that before correct key-presses, (b) error and posterror key-presses were slower than matched correct key-presses, and (c) errors were preceded by greater variability in speed than were matched correct key-presses. Our results suggest that errors are preceded by a behavioural signature, which may indicate breakdown of fluid cognition, and that the effects of error detection on performance (error and posterror slowing) can be dissociated from breakdown effects (preerror increase in variability)

Citation and download: Kalfaoğlu, Ç., & Stafford, T. (2013). Performance breakdown effects dissociate from error detection effects in typing. The Quarterly Journal of Experimental Psychology, 67(3), 508-524. doi:10.1080/17470218.2013.820762

Posted in Research | Comments closed

First visualise, then test

My undergraduate project students are in the final stages of their writing up. We’ve had a lot of meetings over the last few weeks about the correct way to analyse their data. It struck me that there was something I wish I’d emphasised more before they started analysing the data – you should visualise your data first, and only then run your statistical test.

It’s all too easy to approach statistical tests as a kind of magic black box which you apply to the data and – cher-ching! – a result comes out (hopefully p<0.05). We teach our students all about the right kinds of tests, and the technical details of reporting them (F values, p values, degrees of freedom and all that). These last few weeks it has felt to me that our focus on teaching these details can obscure the big picture – you need to understand your data before you can understand the statistical test. And understanding the data means first you want to see the shape of the distributions and the tendency for any difference between groups. This means histograms of the individual scores (how are they distributed? outliers?), scatterplots of variables against each other (any correlation?) and a simple eye-balling of the means for different experimental conditions (how big is the difference? Is it in the direction you expected?).

Without this preparatory stage where you get an appreciation for the form of the data, you risk running an inappropriate test, or running the appropriate test but not knowing what it means (for example, you get a significant difference between the groups, but you haven’t checked first whether it is in the direction predicted or not). These statistical tests are not a magic black box to meaning, they are props for our intuition. You look at the graph and think that Group A scored higher on average than Group B. Now your t-test tells you something about whether your intuition is reliable, or whether you have been fooling yourself through wishful thinking (all too easy to do).

The technical details of running and reporting statistical tests are important, but they are not as important as making an argument about the patterns in the data. Your tests support this argument – they don’t determine it.

Further reading:

Abelson, R. P. (1995). Statistics as principled argument. Psychology Press.
Posted in Teaching | Comments closed

Tracing the Trajectory of Skill Learning With a Very Large Sample of Online Game Players

I am very excited about this work, just published in Psychological Science. Working with a online game developer, I was able to access data from over 850,000 players. This allowed myself and Mike Dewar to look at the learning curve in an unprecedented level of detail. The paper is only a few pages long, and there are some great graphs. Using this real-world learning data set we were able to show that some long-established findings from the literature hold in this domain, as well as confirm a new finding from this lab on the value of exploration during learning.

However, rather than the science, in this post I’d like to focus on the methods we used. When I first downloaded the game data I thought I’d be able to use the same approach I was used to using with data sets gathered in the lab – look at the data, maybe in a spreadsheet application like Excel, and then run some analyses using a statistics package, such as SPSS. I was rudely awakened. Firstly, the dataset was so large that my computer couldn’t load it all into memory at one time – meaning that you couldn’t simply ‘look’ at the data in Excel. Secondly, the conventional statistical approaches I was used to, and programming techniques, either weren’t appropriate or didn’t work. I spent five solid days writing matlab code to calculate the practice vs mean performance graph of the data. It took two days to run each time and still didn’t give me the level of detail I wanted from the analysis.

Enter, Mike Dewar, dataist and currently employed in the New York Times R&D Lab. Speaking to Mike over Skype, he knocked up a Python script in two minutes which did in 30 seconds what my matlab script had taken two days to do. It was obvious I was going to have to learn to code in Python. Mike also persuaded me that the data should be open, so we started a github repository which holds the raw data and all the analysis scripts.

This means that if you want to check any of the results in our paper, or extend them, you can replicate our exact analysis, inspecting the code for errors or interrogating the data for patterns we didn’t spot. There are obvious benefits to the scientific community of this way of working. There are even benefits to us. When one of the reviewers questioned a cut-off value we had used in the analysis, we were able to write back that the exact value didn’t matter, and invited them to check for themselves by downloading our data and code. Even if the reviewer didn’t do this, I’m sure our response carried more weight since they knew they could have easily checked our claim if they had wanted. (Our full response to the first reviews, as well as a pre-print of the paper is available via the repository also).

Paper: Stafford, T. & Dewar, M. (2014). Tracing the Trajectory of Skill Learning With a Very Large Sample of Online Game Players. Psychological Science

Data and Analysis code: github.com/tomstafford/axongame

Posted in Research | Comments closed

New project: “Bias and Blame: Do Moral Interactions Modulate the Expression of Implicit Bias?”

The Leverhulme Trust has awarded a 36 month grant to the University of Nottingham, for a project led by my collaborator Dr Jules Holroyd, with support from myself. The project title is “Bias and Blame: Do Moral Interactions Modulate the Expression of Implicit Bias?” (abstract below). The aim is to conduct experiments to advance our understanding of how implicit biases are regulated by ‘moral interactions’ (these are things such as being blamed, or being held responsible). The grant will pay for a post-doc (Robin Scaife) in Sheffield and a PhD student (as yet unknown, let us know if you’re interested!) in Nottingham.

Obviously, this is something of a departure for myself, at least as far as the topic goes (which is why Jules leads). I’m hoping my background in decision making and training in experimental design will help me navigate the new conceptual waters of implicit bias. Some credit for inspiring the project should go to Jenny Saul and her Bias Project, and before that, Alec Patton and his faith in interdisciplinary dialogue that helped get Jules and myself talking about how experiments and philosophical analysis could help each other out.

Project Abstract:

This project will investigate whether moral interactions are useful tool for regulating implicit bias. Studies have shown that implicit biases – automatic associations which operate without reflective control – can lead to unintentionally differential or unfair treatment of stigmatised individuals. Such biases are widespread, resistant to deliberate moderation, and have a significant role in influencing judgement and action. Strategies for regulating implicit bias have been developed, tested and evaluated by psychologists and philosophers. But neither have explored whether holding individuals responsible for implicit biases may help or hinder their regulation. This is what we propose to do.

Posted in Projects, Research | Comments closed
  • I am a lecturer in Psychology and Cognitive Science at the University of Sheffield.. I am my department's Director of Public Engagement

    Contact: Department of Psychology
    University of Sheffield
    Western Bank
    S10 2TP

    Phone: +44 114 2226620
    Email: t.stafford [at] shef.ac.uk