Sponsored By

In Defense of Game Review Scores

Review scores have become very unfashionable. They've been replaced with badges, recommendation lists, and a whole pile of categorizations. Largely though the problem here isn't with review scores. It's in how people use them.

Michael Heron, Blogger

November 13, 2018

16 Min Read
Game Developer logo in a gray background | Game Developer

This is a modified version of a post that first appeared on Meeple Like Us.  

You can read more of my writing over at the Meeple Like Us blog, or the Textual Intercourse blog over at Epitaph Online.  You can some information about my research interests over at my personal homepage.

---

Introduction

It’s hard to overstate the unfashionability of scores in game reviews.   Outlets have been discarding them with merry abandon over the course of the past few years, often remarking that we live in a brave new world and they no longer serve any useful purpose.   Many sites have never used review scores.  They point out, perfectly reasonably, that they are an inelegant tool that at best are unsophisticated and at worst actively misleading.    Polygon put it this way in their recent announcement:

We believe that a new strategy, focusing on criticism and curation, will better serve our readers than the serviceable but ultimately limited reviews rubric that, for decades, has functioned as a load-bearing pillar of most game publications.

It’s fair to say that to stand in support of a review score is to mark yourself out as a kind of critical cave-dweller, peering into the gloom of the past and ignoring the shiny promise of the future.   People are more sophisticated now.  Games media has gone all artsy and high concept.  As such readers, or viewers, should focus on the content of the review and not the final grade at the end.  As a massive media snob myself, this is a view for which I have some sympathy.  In the end though I find it unconvincing.

Review Scores on Meeple LIke Us

Meeple Like Us uses a review scale – you’ve probably noticed.  That scale is made up of five stars, and half stars can be awarded.  it’s a nine point scale and it breaks down into the following broad categories:

 

Rating

This is the kind of game they play in the Bad Place.  No game has been given this to date.

Awful.   Not fully irredeemable but falls far short of what we should expect in this day and age.

It has its merits but ultimately it’s not a game that the reviewer enjoyed.

Thoroughly mediocre.   You’ll probably forget it the minute you stop playing.  Not good, not bad.

Above average.  You will have an okay time playing this, and it can be fun in the right circumstances.

A good game, worth considering.

A great game, well worth considering

An excellent game – there’s no such thing as a must buy game but if there was this would be one.

Yeah, ignore what I just said.  This is a genuine must buy game.  Currently a score only achieved by a single title on Meeple Like Us.

 

These are completely subjective categories – the difference between a 3.5 and a 4 star game might be as slim as the mood I’m in at the time I write the text.   Usually though there’s a fair degree of consistency.  In the end I believe that good and great are actually meaningful descriptors even if articulating the difference might be complicated and subject to disagreement.

However, we also accompany every review with a short TL;DR section.   There are standard TLDR texts attached to each of these review scores, but the plugin we use is configurable and permits us to change it in specific circumstances.   Most of the time we don’t because the categories are broad enough to be widely encompassing while still specific enough to be meaningful.   However, some of the bespoke TLDRs we have given are:

Game

Rating

TL;DR

Arkham Horror Card Game

4

We think it’s great but the poor value proposition makes it difficult to recommend!

Blank

3.5

Look, it’s hard to say. It might be one star, it might be five. It’s almost entirely up to you!

Catan

3.5

It’s… it’s complicated.

Clank!

4

We think it’s great but the type of group you have is hugely important

Ganz Schon Clever

3.5

More fun than the rating might imply, but that fun is very fragile

Race for the Galaxy

4

Hard to recommend because of the learning curve, but an excellent game

Tash-Kalar

4

Probably the best game I never actually want to play

Most games don’t get a custom TLDR because in the end the number of them that have a review that is significantly at odds with the end score are a small minority.    In fact, out of the 153 games for which we’ve written up reviews, only nine of them have a custom TLDR message.  That’s just over 6% of the games we’ve covered where I felt that the review score wasn’t fully compatible with the text of the review.  In none of these cases do I feel like the combination of a review score with a custom TLDR message falls short of the needs of the reader.

Opinions there can obviously differ, but it seems like if review scores aren’t reflective of the text the problem is either with the text or the rating rather than with the concept of review scores in general.  For the small minority of cases where a review score is not enough, a short sentence can easily provide the rest of the necessary context.

The Pros and Cons of Review Scores

That might though be a case of putting the cart before the horse – what benefits do we get from incorporating a review score?  Why even bother?  Well…

  1. They impart a lot of information in a very small piece of data. Not everyone will suffer through a two or three-thousand word epic review to find out what someone has to say.

  2. They are summative, giving an at a glance view of how the pros and cons balance out in the end.

  3. They permit algorithmic comparisons of one review to another, and one game to another.

  4. They can permit prioritisation of consumption. We have 153 reviews written – people can choose to focus only on those games that we think are the best, or worst, to make it easier to deal with a backlog.

What then are the problems that review scores are causing?

  1. They flatten the sophistication of what should be a nuanced piece of review and criticism into a single grade that will be lacking in context. People will often just see the score and never read farther.

  2. They are a focus of ire and aggression, nudging conversation away from the substantive points of a review and towards the way it was eventually scored.

  3. Algorithmic comparisons of reviews create the soil within which things like Metacritic can spring, and comparing ratings is more complicated that just normalising their denominators.

  4. They can misdirect, leading people to miss out on games that are interesting but poorly scored in favour of those that are less ambitious but more consistently executed.

But the thing is – many of these are a consequence of factors outside the control of a review outlet.   You don’t make people read criticism by withholding a final judgement – they’ll read it with or without that.   To believe otherwise is, I think, to mistake eyeballs for attention.    The absence of a score by itself isn’t going to get anyone reading that wasn’t already intending to read anyway.

If the audience gets riled up at review scores, such as when a favourite title is poorly rated, that’s something in them not something in the review.   To remove review scores as a protection against fanboy aggression is tantamount to negotiating with a really sad and pathetic cadre of wannabe terrorists.

Neither of those stances sway me.

We’re on to more meaningful territory though with the other points.   There is a difference between a 5/5 review and a 10/10 review.  There’s a difference between a 3/5 review and a 6/10 review.   The number of points on a scale is a structural description of nuance, and the choice to award 3 or 4 stars is not the same as choosing between 6 and 8.  In the latter case you’ve also got 7 as an option.   Similarly, there’s a world of difference between a site that grades on a curve and a site that grades according to some fixed measure of value.   Algorithmic comparisons of reviews become possible with review scores, but the comparison is often flawed and the end conclusion similarly so.  That’s true even if you believe it’s possible to average out the subjective appreciation that is inherent in every outlet.

When sites like Metacritic normalise review scores they end up comparing apples against oranges – or at least, Granny Smith apples against Golden Delicious.     Real people can appreciate this nuance implicitly in the way an algorithm can’t, and these comparison sites have a powerfully distorting effect on the marketplace.   It’s not uncommon for examples for video games to be contracted with the expectation they receive particular Metacritic scores.  Remuneration for work is likewise often tied to its reception.     As soon as you attach even the illusion of quantification to a piece of evaluative work, people will want to quantify those quantifications in bulk.  The result can be useful, but it can also be intensely misleading.  Value is not an inherent property of measurement, and vice versa.

Much of the progress of a creative endeavour is built on the idea that it’s the job of some proportion of its outputs to push the envelope so as to make the environment for the next more interesting.   Review scores tend to disincentivise that by disproportionately rewarding the competent over the creative.   A game that is interesting and flawed will likely receive lower scores than one that is predictable but satisfying.   Quirky often gets lots of praise in the text of a review, but comparatively little of it will make its way into the review score.  In the end quirky is impossible to quantify by its very nature.   That in turn creates a depressive effect on people being adventurous with the work they do – pushing the envelope is vital for a creative endeavour, but who says it has to be you that pushes it?  Safe is safe.

As to the claim that review scores skew upwards, that’s probably true but there are external factors there as well.  There are too many games, and the games that get reviews in the first place are almost certainly the ones that pass the initial audition phase.   Some games don’t make the cut of warranting the work and attention that goes into a review.   There’s a powerful self-selection bias that goes into this.  If every game was reviewed, I suspect the overall average would be closer to what people would mathematically expect it to be.

I understand then the objection outlets have to review scores.  I don’t believe though that the misuse of a number beyond the context in which it was presented is a flaw in the system or in the outlets that use it.   We live in a world where even genuinely objective quantifiable data can be misused to nefarious ends.  Even in board gaming we have sites like Dice Critic attempting to put an algorithm on the ranking of fun.

Alternate Approaches

Most of the outlets that have abandoned quantification of reviews have attempted to square the circle by exploring alternate approaches that address the problems introduced by review scores.  Polygon for example expanded on their motivations for moving to a score-free environment while noting the role that review scores play:

At the same time, we recognize that review scores have served a purpose. We don’t see this decision as a stance against scores. They aren’t inherently bad; in fact, they can be very helpful in their efficiency! 

Their solution is the Polygon Recommends badge.

To help readers who are short on time know what to play, we will now include the Polygon Recommends badge on essays about games that we strongly believe most of our readers should play (or watch). It is not a final verdict on a game, nor does it suggest we have played every moment of what the game has to offer. Instead, the Polygon Recommends badge is a statement that we’ve played enough of the game to feel comfortable putting our support behind it. When a game receives the badge, it will appear at the top of the essay, review or video.

This is probably the most common approach to getting rid of review scores entirely.  Eurogamer for example has four categories – essential, recommended, avoid, and simply not awarded one of the others.   Kotaku briefly experimented with a Yes / No / Not Yet recommendation that accompanied each game.  Rock Paper Shotgun has its system of ‘Bestest Bests’.     Ironically, in abandoning review scores almost every single one of these outlets has resorted to an even broader, flatter replacement.  That’s usually the assignment of a badge or descriptor that indicates a game is recommended.  This has the effect of turning a scale of approval into a binary switch.     They have replaced a sliding scale of enthusiasm with an even more stark and abrupt bifurcation – ‘games you should care about’ versus ‘games you shouldn’t’.

To be fair, that judgement is rarely one that is explicitly endorsed by the publications in question.   They point out that games can be full of their own quiet merits without rising to the lofty levels of earning an unqualified recommendation.    They stress that the review matters, not the badge.   They encourage readers to engage with the writing, not with the evaluation.   In the end though prioritisation of attention works exactly the same way with a badge as it does with a review score.  Indeed, it probably exacerbates the problems of directing and misdirecting attention.  The harder an accolade is to obtain, the greater its perceived value.   I think such badges and broad descriptors are a solution that’s worse than its problem.   They’re scores by a different name on a very inaccurate scale.

Some sites have been braver even than this, abandoning all such labels and descriptors in favour of an unadorned and unaccompanied piece of critical content.    Honestly, as much as I bang on about how great it is for reviews to be deep and nuanced and critical… I’m also someone with a dearth of time and I’m unlikely to read a review that doesn’t at least prep me for what I’m going to find within.   I don’t want to read two thousand words of analysis to find out that you felt exactly the same way everyone else did.    When I see a signposted negative review of a universally popular game, or vice versa… well, that gets my interest.

I admire the commitment shown by those outlets that produce content without stark judgement, but I often pass them by in favour of more easily prioritised fare.  Especially since by the end of these kind of reviews I’m often left unsure about the conclusions I should be drawing.   It’s one thing to talk about the flaws and strengths of a game.  If I haven’t played it myself I have no idea how I am supposed to weight these disconnected judgements into a coherent classification.   Even those reviews that come with ‘pros’ and ‘cons’ still leave it up to me to work out how important each is.   Worse are those reviews that don’t ever offer up a conclusion or an evaluation, and just leave it up to me to divine from the entrails of the writing.  Buddy, I’m drowning in content here – at least throw me a life-jacket.

The compromise here is a review that is all critical content and then a strong, unambiguous conclusion.  In that case though ’m left wondering what the difference is between clear, declarative textual judgements and a number that encapsulates the same information.  It seems in the end the only real difference is perception.

Conclusion

So, what does my ideal review score system look like?  Spoiler, it looks a lot like the one we use here.  I’ve been struggling with having review scores for a long time.  I like what we have now, because it’s a product of numerous technical and philosophical refinements.

  1. The scale is precise enough to offer a meaningful range of evaluations, but not so precise that differences are inherently meaningless. What’s the difference between a game that scores 95% and one that scores 96%?

  2. The review score is paired with a short, evaluative statement that when taken together can signpost especially conflicted reviews. Deviation from the boilerplate of this has to be a small minority of reviews otherwise the text is not actually working as intended.

  3. The review score is a part of the review, not the review itself. It is the summary of the text as opposed to a category against which the review is written.  It signposts the direction the review will take but carries with it no critique in and of itself.

  4. The score given has a transparent meaning, but we never pretend it has objectivity.  We’re essentially offering a scale with well-structured degrees of subjective evaluation.

It’s very fashionable to say that review scores are a problem, and I concede that they’re often used in a way that is contrary to their merits.    However, I think they have so much value as a part of a review that to abandon them is to do a disservice to your readers.     They’re important consumer information even if they can’t and shouldn’t be taken as objective.  I think though the best thing they do is force you to nail your colours to the mast – you have made a judgement, it’s up and public, and people can agree or disagree as necessary.    For that to be true though they also have to be used consistently and in a way that is harmonious with the text.   It’s a failing if your review is at odds with your rating, but that failing is internal to what you’ve written.  It’s not an inherent flaw of review systems generally.

I like the system we use here – it has solved many of the problems I have had with review scores since starting the site.   I’m interested to know though what others may think.

Read more about:

2018Blogs
Daily news, dev blogs, and stories from Game Developer straight to your inbox

You May Also Like