• Latest News below, see the tabs above for more details of things I do.

Bias mitigation

200px-Unbalanced_scales2.svgOn Friday I gave a talk on cognitive and implicit biases, to a group of employment tribunal judges. The judges were a great audience, far younger, more receptive and more diverse than my own prejudices had led me to expect, and I enjoyed the opportunity to think about the area of cognitive bias, and how some conclusions from that literature might be usefully carried over to the related area of implicit bias.

First off, let’s define cognitive bias versus implicit bias. Cognitive bias is a catch all term for systematic flaws in thinking. The phrase is associated with the ‘Judgement and decision making’ literature which was spearheaded by Daniel Kahneman and colleagues (and for which he received the Nobel Prize in 2002). Implicit bias, for our purposes, refers to a bias in judgements of other people which is both unduly influenced by social categories such as sex or ethnicity and in which the person making this biased judgement is either unaware or unable to control the undue influence.

So from the cognitive bias literature we get a menagerie of biases such as ‘the overconfidence effect‘, ‘confirmation bias‘, ‘anchoring‘, ‘base rate neglect‘, and on and on. From implicit bias we get findings such as that maths exam papers are marked higher when they carry a male name on the top, job applicants with stereotypically black American names have to send out twice as many CVs, on average, to get an interview or that people sit further away from someone they believe has a mental health condition such as schizophrenia. Importantly all these behaviours are observed in individuals who insist that they are not only not sexist/racist/prejudiced but are actively anti-sexism/racism/prejudice.

My argument to the judges boiled down to four key points, which I think build on one another:

1. Implicit biases are cognitive biases

There is slippage in how we identify cognitive biases compared to how we identify implicit biases. Cognitive biases are defined against a standard of rationality – either we know the correct answer (as in the Wason selection task, for example), or we feel able to define irrelevant factors which shouldn’t affect a decision (as in the framing effect found with the ‘Asian Disease problem‘). Implicit biases use the second, contrastive, standard. Additionally it is unclear whether the thing being violated is a standard of rationality, or a standard of equity. So, for example, it is unjust to allow the sex of a student influence their exam score, but is it irrational? (If you think there is a clear answer to this, either way, then you are more confident of the ultimate definition of rationality than a full century of scholars).

Despite these differences, implicit biases can usefully be thought of as a kind of cognitive bias. They are a habit of thought, which produces systematic errors, and which we may be unaware we are deploying (although elsewhere I have argued that the evidence for the unconscious nature of these process is over-egged). Once you start to think of implicit biases and cognitive biases as very similar, it buys some important insights.


2. Biases are integral to thinking

Cognitive biases exist for a reason. They are not rogue processes which contaminate what would be otherwise intelligent thought. They are the foundation of intelligent thought. To grasp this, you need to appreciate just how hard principled, consistent thought is. In a world of limited time, information, certainty and intellectual energy cognitive biases arise from necessary short-cuts and assumptions which keep our intellectual show on the road. Time and time again psychologists have looked at specific cognitive biases and found that there is a good reason for people to make that mistake. Sometimes they even find that animals make that mistake, demonstrating that even without the human traits of pride, ideological confusion and general self-consciousness the error persists – suggesting that there are good evolutionary reasons for it to exist.

For an example, take confirmation bias. Although there are risks to preferring to seek information that confirms whatever you already believe, the strategy does provide a way of dealing with complex information, and a starting point (i.e. what you already suspect) which is as good as any other starting point. It doesn’t require that you speculate endless about what might be true, and in many situations the world (or other people) is more than likely to put contradictory evidence in front of you without you having to expend effort in seeking it out. Confirmation bias exists because it is an efficient information seeking strategy – certainly more efficient than constantly trying to disprove every aspect of what you believe.

Implicit biases concern social judgement and socially significant behaviours, but they also seem to share a common mechanism. In cognitive terms, implicit biases arise from our tendency towards associative thoughts – we pick up on things which co-occur, and have the tendency to make judgements relying on these associations, even if strict logic does not justify it. The scope of how associations are created and strengthened in our minds is beyond the scope of the post.

For now it is clear that making judgements based on circumstantial evidence is unjustified but practical. An uncontentious example might be you get sick after eating at a particular noodle bar. Maybe it was bad luck, you were going to get sick anyway or it was the sandwich you ate a lunch, but the odds are good you’ll avoid the noodle bar in the future. Why chance it, there are plenty of other restaurants? It would be impractical to never make some assumptions, and the assumption-laden (biased!) route offers a practical solution to the riddle of what you should conclude from your food poisoning.

3. There is no bias-free individual

Once you realise that our thinking is built on many fast, assumption-making, processes which may not be perfect – indeed which have systematic tendencies which produce the errors we identify as cognitive bias – you then realise that it would be impossible to have bias-free decision processes. If you want to make good choices today rather than a perfect choices in the distant future, you have to compromise and accept decisions which will have some biases in them. You cannot free yourself of bias, in this sense, and you shouldn’t expect to.

This realisation encourages some humility in the face of cognitive bias. We all have biases, and we shouldn’t pretend that we don’t or hope that we can free ourselves of them.

We can be aware of the biases we are exposed to and likely to harbour within ourselves. We can, with a collective effort, change the content of the biases we foster as a culture. We can try hard to identify situations where bias may play a larger role, or identify particular biases which are latent in our culture or thinking. We can direct our bias mitigation efforts at particularly important decisions, or decisions we think are particularly likely to be prone to bias. But bias-free thinking isn’t an option, it is part of who we are.

4. Many effective mitigation strategies will be supra-personal:

If humility in the face of bias is the first practical reaction to the science of cognitive bias, I’d argue that second is to recognise that bias isn’t something you can solve on your own at a personal psychological level. Obviously you have to start by trying your honest best to be clear-headed and reasonable, but all the evidence suggests that biases will persist, that they cannot be cut out of thinking and may even thrive when we think ourselves most objective.

The solution is to embed yourself in groups, procedures and institutions which help counter-act bias. Obviously, to a large extent, the institutions of law have evolved to counter personal biases. It would be an interesting exercise to review how legal cases are conducted from a psychological perspective, interpreting different features as to how they work with or against our cognitive tendencies (so, for example, the adversarial system doesn’t get rid of confirmation bias, but it does mean that confirmation bias is given equal and opposite opportunity to work in the minds of the two advocates).

Amongst other kinds of ‘ecological control‘ we might count proper procedure (following the letter of the law, checklists, etc), control of (admissible) information and the systematic collection of feedback (without which you may not ever come to realise that you are making systematically biased decisions).

Slides from my talk here as Google docs slides and as PDF. Thanks to Robin Scaife for comments on a draft of this post. Cross-posted to the blog of our Leverhulme trust funded project on “Bias and Blame“.

Posted in Research | Comments closed

Power analysis for a between-sample experiment

Understanding statistical power is essential if you want to avoid wasting your time in psychology. The power of an experiment is its sensitivity – the likelihood that, if the effect tested for is real, your experiment will be able to detect it.

Statistical power is determined by the type of statistical test you are doing, the number of people you test and the effect size. The effect size is, in turn, determined by the reliability of the thing you are measuring, and how much it is pushed around by whatever you are manipulating.

Since it is a common test, I’ve been doing a power analysis for a two-sample (two-sided) t-test, for small, medium and large effects (as conventionally defined). The results should worry you.


This graph shows you how many people you need in each group for your test to have 80% power (a standard desirable level of power – meaning that if your effect is real you’ve an 80% chance of detecting it).

Things to note:

  • even for a large (0.8) effect you need close to 30 people (total n = 60) to have 80% power
  • for a medium effect (0.5) this is more like 70 people (total n = 140)
  • the required sample size increases drammatically as effect size drops
  • for small effects, the sample required for 80% is around 400 in each group (total n = 800).

What this means is that if you don’t have a large effect, studies with between groups analysis and an n of less than 60 aren’t worth running. Even if you are studying a real phenomenon you aren’t using a statistical lens with enough sensitivity to be able to tell. You’ll get to the end and won’t know if the phenomenon you are looking for isn’t real or if you just got unlucky with who you tested.

Implications for anyone planning an experiment:

  • Is your effect very strong? If so, you may rely on a smaller sample (For illustrative purposes the effect size of male-female heigh difference is ~1.7, so large enough to detect with small sample. But if your effect is this obvious, why do you need an experiment?)
  • You really should prefer within-sample analysis, whenever possible (power analysis of this left as an exercise)
  • You can get away with smaller samples if you make your measure more reliable, or if you make your manipulation more impactful. Both of these will increase your effect size, the first by narrowing the variance within each group, the second by increasing the distance between them

Technical note: I did this cribbing code from Rob Kabacoff’s helpful page on power analysis. Code for the graph shown here is here. I use and recommend Rstudio.

Posted in Research, Teaching | Comments closed

New grant: ‘Neuroimaging as a marker of Attention Deficit Hyperactivity Disorder (ADHD)’

We have been awarded ~£11k by the White Rose Collaboration Fund. This will allow us to carry out a small neuroimaging study investigating brain activity associated with higher levels of ADHD traits. The collaboration combines expertise and facilities across the Universities of Sheffield, Leeds and York. Paul Overton has previously proposed that the subcortical area known as the superior colliculus may be crucial in ADHD. This is the focus of Maria’s PhD thesis (co-supervised by Paul and me). Jaclyn Billington from Leeds has experience imaging the colliculus, and Tony Morland is the deputy director of York’s neuroimaging facility (as well as having a wealth of experience imaging the areas associated with visual function). Alex Wade and Jeff Delvenne provide additional expertise in visual attention. I lead the project.

Here is the blurb:

We will create a unique network of expertise, personnel and facilities from across the WR network in order to establish a novel biomarker of Attention Deficit Hyperactivity Disorder (ADHD).

Despite a high prevalence (up to 10% of children by some estimates), ADHD remains controversial in terms diagnosis and treatment. Using brain scanning, this network aims to establish a biological marker common to all ADHD suffers. Such a biomarker could revolutionise our response to ADHD, allowing us to better understand the condition, diagnose earlier, manage the symptoms and target pharmacological interventions. This could potentially alleviate suffering and improve function for millions.

Theoretical direction for this proposal arises from Overton’s recent proposal that a core dysfunction in ADHD is hypersensitivity of the Superior Colliculus (SC), a key subcortical brain region known to play a critical role in attention, spatial orientation and saccadic eye movements. The development of this ‘collicular hypersensitivity’ hypothesis was possible because of the tradition of research into the fundamental neuroscience of subcortical structures at Sheffield.

This hypothesis has been taken forward by Stafford (Sheffield) who, with Panagiotidi, has been developing behavioural tests of collicular sensitivity. Early results show that healthy adults who are high and low on ADHD traits differ in these behavioural measures. However, behavioural tests are limited in that they cannot provide definitive insight into the neural basis of function. Teams in York and Leeds provide expertise in functional brain imaging and the neural basis of attention which would allow the direct translation of the Sheffield research programme into a test of a biomarker for ADHD.

Our primary objective will be to test two groups, high and low in ADHD traits for collicular responsiveness, using fMRI brain imaging. This testing will use behavioural measures which have been shown to discriminate the two groups, and analytic and imaging expertise from the Leeds and York based applicants in order to determine collicular responsiveness

Posted in Projects | Comments closed

Event: Crowdsourcing Psychology Data – Online, Mobile and Big Data approaches

StaffordFig3Smart phones, social media and networked sensors in everything from trains to toasters – The spread of digital technology creates new opportunities for cognitive scientists. Collecting and analysing the resulting “big data” also poses its own special challenges. This afternoon of talks and discussion is suitable for anyone curious about novel data collection and analysis strategies and how they can be deployed in psychological and behavioural research.

Time: 1pm-5pm, 11th of November 2014

Venue: Department of Psychology, University of Sheffield

We have four speakers followed by a panel discussion. Our speakers:

Martin Thirkettle: “Taking cognitive psychology to the small screen: Making a research focussed mobile app”

Developing a mobile app involves balancing a number of parties – researchers, funders, ethics committees, app developers, not to mention the end users. As the Open University’s “Brainwave” app, our first research-focussed cognitive psychology app, nears launch, I will discuss some of the challenges we’ve faced during the development process.

Caspar Addyman: “Measuring drug use with smartphones: Some misadventures”

Everyday drug use and its effects are not easily captured by lab or survey-based research. I developed the Boozerlyzer, an app that let people log their alcohol intake, their mood and play simple games that measured their cognitive and emotional responses. Although this had its flaws it led to a NHS funded collaboration to develop a simple smartphone tracker for Parkinson’s patients. Which was also problematic..

Robb Rutledge: “Crowdsourcing the cognitive science of decision making and well-being”

Some cognitive science questions can be particularly difficult to address in the lab. I will discuss results from The Great Brain Experiment, an app that allowed us to develop computational models for how decision making changes across the lifespan, and also how rewards and expectations relate to subjective well-being.

Andy Woods: “[C]lick your screen: probing the senses online”

We are at the cusp of some far-reaching technological advances that will be of tremendous benefit to research. Within a few short years we will be able to test thousands of people from any demographic with ‘connected’ technology every bit as good as we use in our labs today — indeed perhaps more so. Here I discuss on-web versus in-lab, predicted technological advances and issues with online research.

Tickets are free and available: here.

Posted in events | Comments closed

New grant: Reduced habitual intrusions : an early marker for Parkinson’s Disease?

SurprisalDensityPlotFor4CharacterWindowI have very pleased to announce that the Michael J Fox Foundation have funded a project I lead titled ‘Reduced habitual intrusions : an early marker for Parkinson’s Disease?’. The project is for 1 year, and is a collaboration between a psychologist (myself), a neuroscientist (Pete Redgrave), a clinician specialising in Parkinson’s (Jose Obeso, in Spain) and a computational linguist (Colin Bannard, in Liverpool). Mariana Leriche will be joining us a post-doc.

The idea of the project stems from hypothesis that Parkinson’s Disease will be specifically characterised by a loss of habitual control in the motor system. This was proposed by Pete, Jose and others in 2010. Since my PhD I’ve been interested automatic processes in behaviour. One phenomenon which seems to offer particular promise for exploring the interaction between habits and deliberate control is the ‘action slip’. This is an error where a habit intrudes into the normal stream of intentional action – for example, such as when you put the cereal in to the fridge, or when someone greets you by asking “Isn’t it a nice day?” and you say “I’m fine thank you”. An interesting prediction of the Redgrave et al theory is people with Parkinson’s should make fewer action slips (in contrast to all other types of movement errors, which you would expect to increase as the disease progresses).

The domain we’re going to look at this in is typing, which I’ve worked with before, and which – I’ve argued – is a great domain for looking at how skill, intention and habit combine in an everyday task which generates lots of easily coded data.

I feel the project reflects exactly the kind of work I aspire to do – cognitive science which uses precise behavioural measurement, informed by both neuroscientific and computational perspectives, and in the service of am ambitious but valuable goal. Now, of course, we actually have to get on and do it.

Posted in Projects, Research | Comments closed

Teaching: what it means to be critical

3282473832_cb97c4e525_mWe often ask students to ‘critically assess’ research, but we probably don’t explain what we mean by this as well as we could. Being ‘critical’ doesn’t mean merely criticising, just as skepticism isn’t the same as cynicism. A cynic thinks everything is worthless, regardless of the evidence; a skeptic wants to be persuaded of the value of things, but needs to understand the evidence first.

When we ask students to critically assess something we want them to do it as skeptics. You’re allowed to praise, as well as blame, a study, but it is important that you explain why.

As a rule of thumb, I distinguish three levels of criticism. These are the kinds of critical thinking that you might include at the end of a review or a final year project, under a “flaws and limitations” type-heading. Taking the least value first (and the one that will win you the least marks), let’s go through the three types one by one:

General criticisms: These are the sorts of flaws that we’re taught to look out for from the very first moment we start studying psychology. Things like too few participants, lack of ecological validity or the study being carried out on a selective population (such as university psychology students). The problem isn’t that these aren’t flaws of many studies, but rather that they are flaws of too many studies. Because these things are almost always true – we’d always like to have more people in our study! we’re never certain if our results will generalise to other populations – it isn’t very interesting to point this out. Far better if you can make …

Specific criticisms: These are things which are specific weakness of the study you are critiquing. Things which you might say as a general criticism become specific criticisms if you can show how they relate to particular weaknesses of a study. So, for example, almost all studies would benefit from more participants (a general criticism), but if you are looking at a study where the experiment and the control group differed on the dependent variable, but the result was non-significant (p=0.09 say), then you can make the specific criticism that the study is under-powered. The numbers tested, and the statistics used, mean that it isn’t possible to resolve either way that there probably is or probably isn’t an effect. It’s simply uncertain. So, they need to try again with more people (or less noise in their measures).

Finding specific criticisms means thinking hard about the logic of how the measures taken relate to psychological concepts (operationalisation) and what the comparisons made (control groups) really mean. A good specific criticism will be particular to the details of the study, showing that you’ve thought about the logic of how an experiment relates to the theoretical claims being considered (that’s why you get more credit for making this kind of criticisms). Specific criticism are good, but even better are…

Specific criticisms with crucial tests or suggestions: This means identifying a flaw in the experiment, or a potential alternative explanation, and simultaneously suggesting how the flaw can be remedied or the alternative explanation can be assessed for how likely it is. This is the hardest to do, because it is the most interesting. If you can do this well you can use existing information (the current study, and its results) to enhance our understanding of what is really true, and to guide our research so we can ask more effective questions next time. Exciting stuff!

Let me give an example. A few years ago I ran course which used a wiki (reader edited webpages) to help the students organise their study. At the end of the course I thought I’d compare the final exam scores of people who used the wiki against those who hadn’t. Surprise: people who used the wiki got better exam scores. An interesting result, I thought, which could suggest that using the wiki helped people understand the material. Next, I imagined I’d written this up as a study and then imagined the criticisms you could make of it. Obviously the major one is that it is observational rather than experimental (there is no control group), but why is this a problem? It’s a problem because there could be all sorts of differences between students which might mean they both score well on the exam and use the wiki more. One way this could manifest is that diligent students used the wiki more, but they also studied harder, and so got better marks because of that. But this criticism can be tested using the existing data. We can look and see if only highly grading students use the wiki. They don’t – there is a spread of students who score well and who score badly, independently of whether they use the wiki or not. In both groups, the ones who use the wiki more score better. This doesn’t settle the matter (we still need to run a randomised control study), but it allows us to finesse our assessment of one criticism (that only good students used the wiki). There are other criticisms (and other checks), you can read about it in the paper we eventually published on the topic.

Overall, you get credit in a critical assessment for showing that you are able to assess the plausibility of the various flaws a study has. You don’t get marks just for identifying as many flaws as possible without balancing them against the merits of the study. All studies have flaws, the interesting thing is to make positive suggestions about what can be confidently learnt from a study, whilst noting the most important flaws, and – if possible – suggesting how they could be dismissed or corrected.

Posted in Teaching | Comments closed

New paper: wiki users get higher exam scores

Just out in Research in Learning Technology, is our paper Students’ engagement with a collaborative wiki tool predicts enhanced written exam performance. This is an observational study which tries to answer the question of how students on my undergraduate cognitive psychology course can improve their grades.

One of the great misconceptions about sudying is that you just need to learn the material. Courses and exams which encourage regurgitation don’t help. In fact, as well as memorising content, you also need to understand it and reflect that understanding in writing. That is what the exam tests (and what an undergraduate education should test, in my opinion). A few years ago I realised, marking exams, that many students weren’t fulfilling their potential to understand and explain, and were relying too much on simply recalling the lecture and textbook content.

To address this, I got rid of the textbook for my course and introduced a wiki – an editable set of webpages, using which the students would write their own textbook. An inspiration for this was a quote from Francis Bacon:

Reading maketh a full man,
conference a ready man,
and writing an exact man.

(the reviewers asked that I remove this quote from the paper, so it has to go here!)

Each year I cleared the wiki and encouraged the people who took the course to read, write and edit using the wiki. I also kept a record of who edited the wiki, and their final exam scores.

The paper uses this data to show that people who made more edits to the wiki scored more highly on the exam. The obvious confound is that people who score more highly on exams will also be the ones who edit the wiki more. We tried to account for this statistically by including students’ scores on their other psychology exams in our analysis. This has the effect – we argue – of removing the general effect of students’ propensity to enjoy psychology and study hard and isolate the additional effect of using the wiki on my particular course.

The result, pleasingly, is that students who used the wiki more scored better on the final exam, even accounting for their general tendancy to score well on exams (as measured by grades for other courses). This means that even among people who generally do badly in exams, and did badly on my exam, those who used the wiki more did better. This is evidence that the wiki is beneficial for everyone, not just people who are good at exams and/or highly motivated to study.

Here’s the graph, Figure 1 from our paper:


This is a large effect – the benefit is around 5 percentage points, easily enough to lift you from a mid 2:2 to a 2:1, or a mid 2:1 to a first.

Fans of wiki research should check out this recent paper Wikipedia Classroom Experiment: bidirectional benefits ofstudents’ engagement in online production communities, which explores potential wider benefits of using wiki editing in the classroom. Our paper is unique for focussing on the bottom line of final course grades, and for trying to address the confound that students who work harder at psychology are likely to both get higher exam scores and use the wiki more.

The true test of the benefit of the wiki would be an experimental intervention where one group of students used a wiki and another did something else. For a discussion of this, and discussion of why we believe editing a wiki is so useful for learning, you’ll have to read the paper.

Thanks go to my collaborators. Harriet reviewed the literature and Herman instaled the wiki for me, and did the analysis. Together we discussed the research and wrote the paper.

Full citation:
Stafford, T., Elgueta, H., Cameron, H. (2014). Students’ engagement with a collaborative wiki tool predicts enhanced written exam performance. Research in Learning Technology, 22, 22797. doi:10.3402/rlt.v22.22797

Posted in Research, Teaching | Comments closed

New paper: Performance breakdown effects dissociate from error detection effects in typing

This is the first work on typing that has come out of C’s PhD thesis. C’s idea, which inspired his PhD, was that typing would be an interesting domain to look at errors and error monitoring. Unlike most discrete trial tasks which have been used to look at errors, typing is a continuous performance task (some of subjects can type over 100 words per minutes, pressing around 10 keys a second!). Futhermore the response you make to signal an error is highly practiced – you press the backspace. Previous research on error signalling hasn’t been able to distinguished between effects due to the error and effects due having to make an unpracticed response to signal that you know you made the error.

For me, typing is a fascinating domain which contradicts some notions of how actions are learnt. The dichotomy between automatic and controlled processing doesn’t obviously apply to typing, which is rapid and low effort (like habits), but flexible and goal-orientated (like controlled processes). A great example of how typing can be used to investigate the complexity of action control comes from this recent paper by Gordan Logan and Matthew Crump (this).

In this paper, we asked skilled touch-typists to copy type some set sentences and analysed the speed of typing before, during and after errors. We found, in contrast to some previous work which had used unpracticed discrete trial tasks to study errors, that there was no change in speed before an error. We did find, however, that typing speeds before errors did increase in variability – something we think signals a loss of control, something akin to slipping “out of the zone” of concentration. A secondary analysis compared errors which participants corrected against those they didn’t correct (and perhaps didn’t even notice they made). This gave us evidence that performance breakdown before an error isn’t just due to the processes that notice and correct errors, but – at least to the extent that error correction is synonymous with error detection – performance breakdown occurs independently of error monitoring.

Here’s the abstract

Mistakes in skilled performance are often observed to be slower than correct actions. This error slowing has been associated with cognitive control processes involved in performance monitoring and error detection. A limited literature on skilled actions, however, suggests that preerror actions may also be slower than accurate actions. This contrasts with findings from unskilled, discrete trial tasks, where preerror performance is usually faster than accurate performance. We tested 3 predictions about error-related behavioural changes in continuous typing performance. We asked participants to type 100 sentences without visual feedback. We found that (a) performance before errors was no different in speed than that before correct key-presses, (b) error and posterror key-presses were slower than matched correct key-presses, and (c) errors were preceded by greater variability in speed than were matched correct key-presses. Our results suggest that errors are preceded by a behavioural signature, which may indicate breakdown of fluid cognition, and that the effects of error detection on performance (error and posterror slowing) can be dissociated from breakdown effects (preerror increase in variability)

Citation and download: Kalfaoğlu, Ç., & Stafford, T. (2013). Performance breakdown effects dissociate from error detection effects in typing. The Quarterly Journal of Experimental Psychology, 67(3), 508-524. doi:10.1080/17470218.2013.820762

Posted in Research | Comments closed

First visualise, then test

My undergraduate project students are in the final stages of their writing up. We’ve had a lot of meetings over the last few weeks about the correct way to analyse their data. It struck me that there was something I wish I’d emphasised more before they started analysing the data – you should visualise your data first, and only then run your statistical test.

It’s all too easy to approach statistical tests as a kind of magic black box which you apply to the data and – cher-ching! – a result comes out (hopefully p<0.05). We teach our students all about the right kinds of tests, and the technical details of reporting them (F values, p values, degrees of freedom and all that). These last few weeks it has felt to me that our focus on teaching these details can obscure the big picture – you need to understand your data before you can understand the statistical test. And understanding the data means first you want to see the shape of the distributions and the tendency for any difference between groups. This means histograms of the individual scores (how are they distributed? outliers?), scatterplots of variables against each other (any correlation?) and a simple eye-balling of the means for different experimental conditions (how big is the difference? Is it in the direction you expected?).

Without this preparatory stage where you get an appreciation for the form of the data, you risk running an inappropriate test, or running the appropriate test but not knowing what it means (for example, you get a significant difference between the groups, but you haven’t checked first whether it is in the direction predicted or not). These statistical tests are not a magic black box to meaning, they are props for our intuition. You look at the graph and think that Group A scored higher on average than Group B. Now your t-test tells you something about whether your intuition is reliable, or whether you have been fooling yourself through wishful thinking (all too easy to do).

The technical details of running and reporting statistical tests are important, but they are not as important as making an argument about the patterns in the data. Your tests support this argument – they don’t determine it.

Further reading:

Abelson, R. P. (1995). Statistics as principled argument. Psychology Press.
Posted in Teaching | Comments closed

Tracing the Trajectory of Skill Learning With a Very Large Sample of Online Game Players

I am very excited about this work, just published in Psychological Science. Working with a online game developer, I was able to access data from over 850,000 players. This allowed myself and Mike Dewar to look at the learning curve in an unprecedented level of detail. The paper is only a few pages long, and there are some great graphs. Using this real-world learning data set we were able to show that some long-established findings from the literature hold in this domain, as well as confirm a new finding from this lab on the value of exploration during learning.

However, rather than the science, in this post I’d like to focus on the methods we used. When I first downloaded the game data I thought I’d be able to use the same approach I was used to using with data sets gathered in the lab – look at the data, maybe in a spreadsheet application like Excel, and then run some analyses using a statistics package, such as SPSS. I was rudely awakened. Firstly, the dataset was so large that my computer couldn’t load it all into memory at one time – meaning that you couldn’t simply ‘look’ at the data in Excel. Secondly, the conventional statistical approaches I was used to, and programming techniques, either weren’t appropriate or didn’t work. I spent five solid days writing matlab code to calculate the practice vs mean performance graph of the data. It took two days to run each time and still didn’t give me the level of detail I wanted from the analysis.

Enter, Mike Dewar, dataist and currently employed in the New York Times R&D Lab. Speaking to Mike over Skype, he knocked up a Python script in two minutes which did in 30 seconds what my matlab script had taken two days to do. It was obvious I was going to have to learn to code in Python. Mike also persuaded me that the data should be open, so we started a github repository which holds the raw data and all the analysis scripts.

This means that if you want to check any of the results in our paper, or extend them, you can replicate our exact analysis, inspecting the code for errors or interrogating the data for patterns we didn’t spot. There are obvious benefits to the scientific community of this way of working. There are even benefits to us. When one of the reviewers questioned a cut-off value we had used in the analysis, we were able to write back that the exact value didn’t matter, and invited them to check for themselves by downloading our data and code. Even if the reviewer didn’t do this, I’m sure our response carried more weight since they knew they could have easily checked our claim if they had wanted. (Our full response to the first reviews, as well as a pre-print of the paper is available via the repository also).

Paper: Stafford, T. & Dewar, M. (2014). Tracing the Trajectory of Skill Learning With a Very Large Sample of Online Game Players. Psychological Science

Data and Analysis code: github.com/tomstafford/axongame

Posted in Research | Comments closed
  • I am a lecturer in Psychology and Cognitive Science at the University of Sheffield.. I am my department's Director of Public Engagement

    Contact: Department of Psychology
    University of Sheffield
    Western Bank
    S10 2TP

    Phone: +44 114 2226620
    Email: t.stafford [at] shef.ac.uk