Originally posted on Medium, 21 September, 2016 (read part 2 here).
(Hint: The answer is yes, we can do it much better and faster.)
Building tech products is hard. Building tech products that actually do what they say on the tin is even harder. And proving all of that holds under the harsh light of statistical analysis is something most people try to avoid.
But that’s precisely the challenge we set ourselves when we put Applied to the test earlier this year.
Applied aims to make recruitment smart, fair, and easy. But what does that mean in practice, and are we actually delivering on that promise?
We spent months baking the best of what we knew from the research into our technology — from the right nudges on job descriptions to the best defaults on the number of people reviewing candidate information. But we know that what should work in theory is not guaranteed to work in practice. And as a product incubated in the Behavioural Insights Team — known for its commitment to rigorous testing — we were keen to apply that same empiricism to ourselves.
We set out to answer this question: does using Applied actually improve hiring decisions? (Spoiler alert, it does, and in ways we hadn’t expected). Below and over our next set of blog posts, we’ll be taking you through our experiment, and what we discovered.
The A/B test
We took advantage of the fact we were hiring graduates into the team, which gave us the sample size to meaningfully test Applied versus a business as usual CV sift. In the former, the best performing 160 candidates (based on a multiple choice test) were reviewed using the Applied process: anonymised, chunked work sample responses independently reviewed by three different team members. The latter involved an initial yes/no sift by our senior HR manager, and then two members of the team independently rated the yes CVs based on the kinds of things we look usually look for.
Rather than run a randomised controlled trial (RCT) where we randomly put some candidates through one process and another group through the other, we decided to be a bit more conservative: all 160 candidates were reviewed by both processes in parallel. This effectively gave them two bites at the cherry: they could get through because they had great scores on their CVs or they could get there because they scored highly on the work sample tests we gave them (some superstars got through both processes).
We then put successful candidates through rigorous multi-hour assessment centres and final interviews; ultimately offering the job to our best-performing group. Importantly, we were careful to keep these processes as blind from one another as possible, to minimise the risk that performance in one round affected how reviewers perceived candidates in another.
This allowed us to test which sifting process did a better job of:
- Being smart: that is, finding the ‘best’ candidate as measured by being more highly correlated with how they performed in later rounds of the process
- Being fair: that is, unearthing a more diverse set of candidates as measured by socio-demographics (including education) and employment background
- Being easy: in this case, being faster on a simple time spent calculation.
Results: the headlines
When we pulled all of the data in, lots of things surprised us.
First, we found that there was a positive, statistically significant correlation between how candidates performed in the Applied sift and their scores in the assessment centre and the final interview. This means that candidates who scored better on Applied also had a high score in the two in-person rounds.
But there was no discernible correlation between their CV score and their performance in those later rounds. It seemed like having an impressive CV wasn’t a good predictor of being successful in those other tests.
The scatter plots below illustrate what we saw in the data.
Applied scores were highly predictive of performance in the assessment centre and final interview, but CV scores weren’t
NB: The regression lines in these graphs depict the relationship between the two variables, based on all the date. The lines for CV scores look negative (i.e. that a candidate with a good CV is less likely to do well in the assessment centre or final interview), they’re not statistically significant, so we interpret them as being essentially meaningless.
We would never have hired (or even met!) a whopping 60 per cent of the candidates we offered jobs to if we’d relied on their CVs alone.
We’d sifted them out based on their CV, but we ended up seeing them because their Applied scores were great. And as it turned out, they were fantastic in person too. This is a pretty huge effect size, and we wouldn’t necessarily expect it to replicate. But even if 1 in 5 candidates were given jobs that otherwise wouldn’t have, across the economy, that’s hundreds of thousands of people getting jobs they otherwise wouldn’t have based on merit.
An obvious question is: does any of this translate into on-the-job performance? Our final graduate cohort is too small to be able to test this statistically, but we’re talking to firms interested to do this.
In part 2, we looked at results on diversity.