The Meaning of Review Scores

Words: 3628 Approximate Reading Time: 25-30 minutes

When I was a kid reading video game reviews in magazines, I was somewhat fixated on those magical numbers. The 8/10, the 94/100, and so on. The scores provided always seemed to have a level of finality and objectivity that I could readily understand. Sure, a bunch of words could explain what a game was about and how it controlled and the graphical quality and the like…but all of that paled in comparison to that number.

“Surely,” young me would think, “a 9/10 or a 10/10 is an excellent game that must be worth my time, while something like a 6/10 or 7/10 is just so mediocre that I shouldn’t bother.” Anything below a 6/10 would be nothing more than a game to be laughed at – it had truly failed so spectacularly that all readers could revel in its badness.

As I’ve grown, I’ve become much more skeptical about review scores. To a point that I tend to dislike them. Perhaps it had something to do with being a teacher and grading papers – the absolute mind-numbing monotony of trying to figure out whether a paper should get an 82 or an 83, when those few points mattered little in any grand scheme, was exhausting. It may well also have something to do with the fact that over the years I’ve learned the value of the actual words in the review. What someone talks about, doesn’t talk about, praises, criticizes, and so on is all a window into that person’s mind. It reveals the way they see games. And it is by learning about those tendencies that I – or anyone else – can learn whether I am likely to agree with that person’s assessments, and thus whether that person is someone I should be listening to.

I am of course at the point where I am also likely to just play a game for myself to reach my own conclusion. Not because everyone else must be wrong and the only correct opinion is mine. Instead, it’s that I know I don’t always align with the opinions of others. I have had games recommended to me that I disliked. I have played games that are generally disliked and found them enjoyable. And many times I’ve ended up agreeing with people about games being good or bad. Sometimes I just need to experience a game for myself, whatever the consequences may be.

I have wanted for a while now to do a piece on review scores through the lens of what I’ll call “the philosophy of measurement.” By which I mean examining what those numbers actually mean down at their core, and what they indicate about the reviewers and the players who use them.

Which, to boil it all down, results in the answer that they mean nothing.

But while I provide that conclusion, I should at least carry us through the actual argument.

The catalyst for finally sitting down and writing this all out was a video a few weeks back by Jimquisition host Stephanie Sterling about Tears of the Kingdom. Specifically, the video is about the backlash they received for giving the game the “low” score of a 7/10 – a number which is in their own metric code for “a good game.” Sterling’s review brought on an onslaught of people attacking them for a lot of different things, and the video itself focuses on the particular argument that not finishing a game invalidates a review.

I’ve already written at length about my experience with Tears, and I am not really concerned about the review itself. Whatever imaginary number I might use to grade Tears, it doesn’t really undermine Sterling’s criticisms or score. Nor is this about the underlying argument about finishing a game to review it. Others – including Sterling – have addressed the faults in the logic.

What I wanted to tackle and dismantle was that these scores hold any real meaning. Whether that meaning is “internal” – some kind of serious consistency within a person’s own mind – or “external” – a solid indication of the game’s quality that we as outsiders can understand.

Ideally, removing these kinds of ratings systems entirely, with audiences relying solely on the actual words provided by reviewers, would be best. But that ideal does not reflect reality. People like those numbers because of their simplicity. They aren’t going away. This essay won’t change that, even if it were to reach millions.

But it is still useful to think about why these scores exist and what they are supposed to accomplish.

A Brief History of Review Scores

The scores we are familiar with in video game reviews are by no means unique. Reviewing and criticism has existed for centuries, or even longer depending on what you consider “reviewing.”

Early Western philosophy usually liked to talk about “principles” of art in very broad terms, sometimes pointing to examples, but without really diving into those examples in detail. The idea of looking at a particular work and breaking it down into its component parts and analyzing those parts to determine their quality was at best rare, and very likely unheard of. That trend continues for quite a while. Changes in what we might broadly call art criticism occur, with shifts toward examining art through a biographical lens and analyzing art through the life of the artist. But this historical trend still doesn’t quite match up to what we envision today.

It’s not until sometime around the 17th or 18th century that the modern concept of “reviewing” rises in Europe. The idea of looking at a specific painting or sculpture and saying “this is what it does well, what it does poorly, and what it evokes as a piece of art” takes hold as the idea of the history of art diverges from the criticism of art.

In other regions of the world you have something we would recognize as “reviewing” taking hold a bit earlier. African and Asian art practices had traditions that were in some cases similar to the broad “principles” in Western philosophy, but also would sometimes connect particular works more cohesively to those principles. So while I’ve focused on the history of art criticism in the West here, it is by no means a universal history.

Of course, we’re not just interested in the history of reviewing, but of review scores. The idea of not just talking about a particular piece of art, but trying to compare that piece to other pieces and especially to capture that relative quality by way of some kind of numerical system.

The idea of applying those scales to what we might broadly call “entertainment” appears to have begun sometime in the early 19th century, when travel writer Mariana Stark used exclamation points to rank travel destinations. Using a numerical scale of some kind to differentiate things would be invented well before Stark’s use, but by my own brief research she appears to be the first to use these scales for explicitly non-scientific purpose. Eventually this idea would be copied and adopted by others, and the most common system became the use of stars. Short stories, hotels, restaurants, and movies would at different times be rated using these star systems or something similar.

It appears that the idea of video game reviews with a score of some kind originated with the trade magazine Play Meter. Sometime in 1976 the magazine began running a segment called “Critic’s Corner” that reviewed new arcade games. The first iteration did not appear to contain actual scores, but after a bit the author would use pound symbols to indicate overall quality – supposedly replicating the “five star” system already popularized in other forms of criticism such as film.

At some point this would give way to a literal numerical system, generally out of 10 points. An example of this would be Famitsu magazine, first published in 1986, which would publish video games reviews using a system of averaging scores from four reviewers. Eventually these kinds of publications would expand and move onto the internet, with magazines such as Game Informer, Nintendo Power, and Electronic Gaming Monthly, and websites such as IGN and GameSpot. Reviews from these kinds of sources would either continue to use the 10 point system, or else the 100 point system – both being really the same system.

And then of course we would get aggregation scores and sites. Metacritic would pull together scores from professional reviewers and everyday consumers to form average scores. Retail stores such as Amazon would allow customers to rate products (out of five stars), giving consumers an opportunity to rate games which would be reflected to other potential customers. And game selling platforms like Steam operate on a straight up/down voting system, then putting those scores together to tell potential consumers what proportion of players thought the game was overall good.

We will in some cases be revisiting these different systems. But the point of this history is to put these scores into a historical context. Video game review scores did not really arise because we thought there was some “true” score that games deserved. They began from complex historical circumstances relating to selling products, copying practices defined by other mediums, and simplifying content for an audience. Keep that all in mind as we continue.

The Internal (Il)Logic of Review Scores

Alright, so review scores on their own are a product of a complicated set of factors. Not every reviewer today uses scores. In fact, most smaller outfits and individual bloggers ignore scores entirely. Certainly a positive outcome of that decision is that readers are required to engage with the actual content of a review. Agree or disagree with the reviewer, but it is the substance of what they say that matters (or is supposed to matter), and not some number.

But where review scores are still used, it is important to ask what they are supposed to mean, and how that meaning is derived. And it is by pulling those ideas apart that we can see how the logic behind scores falls apart.

So to address this topic, we need to see the underlying problems of constructing a mathematical system for scoring games. Particularly, the ways in which those systems devolve into vagueness, arbitrariness, and internal inconsistency.

Vague Scores

In order for any rating system to mean anything, there must be some clear indicator of what the various points on the scale mean. Say, for example, that you were building a rating system that ranged from 1 to 10 – 1 being the lowest, 10 the highest.

To be able to even start putting games on this scale, you would need some kind of idea of what could possibly constitute a 1/10 game, and a 10/10 game. Why? You could, after all, just guess with a handful of games, and then start building out from there. You give one game a 7/10, and then the next game you don’t like as much so it’s a 6/10, but then the next game is really good so you give it a 9/10, and then you just keep building out from there.

But the “building” system requires constant re-evaluation. Because if you wind up with any 10/10s, and then wind up finding a game that you like even better…what are you supposed to do? If you give it an 11/10, the scale becomes useless. If it’s just a 10/10, then the perfect 10 doesn’t really indicate what it is supposed to convey. So the only real answer – if the scores are still meant to be good information – is to go back and rebuild the whole ranking. Which no one ever does. After all, it’s a lot of work.

If you’re someone who also does this kind of ranking professionally, it would also be detrimental to revise scores way down the road. What if someone relied on you saying a game was an 8/10, and then later realizes that you actually downgraded it to a 6/10 because you realized your system was out of whack? You’ve just outed your scores as unreliable, which makes you as a reviewer unreliable.

So the underlying question is what do the endpoints mean? How often do games actually reach them?

And if you look to most – probably almost all – reviewers, this is a common problem. Many review sites are unlikely to hand out scores outside of the 6-10 range, with very minimal exceptions. And many games will hover around the 7-9 range. But that in turn raises questions about what any of those scores mean. What is an “average” game supposed to be? If it’s 7 (akin to academic grading), then that creates much greater range for “below average” than “above average.” If it’s 5 (the middle of the range), where do we then put other games on the scale – what differentiates a 5 from a 6, a 6 from a 7, and so on?

The vagueness of these scores makes them useless. It’s not that you can’t get any information from numerical scores, but those scores are just mere estimates of what you think the number might mean, based on an estimate of what the reviewer thinks the number might mean. It is guesswork all the way down.

The potential exception to this is if a reviewer took the time to carefully indicate what the range meant and to properly utilize that range. To give the audience some clear knowledge about how to read the system. But because that is rare to almost nonexistent, we may as well ignore the idea altogether.

Arbitrariness

Let’s say you’ve got a game that you want to call a solid 7/10. And then you have another game that’s a little bit better. What score do you give it?

If you give it a 7, that indicates you think it’s of the same quality as the initial game. But that’s not accurate.

You could give it an 8, but that indicates it is a good deal better. Again, not really accurate.

Aha! Let’s get into decimals. It’s a 7.5. Although maybe that’s not really accurate either. How about a 7.25? Okay, there we go. We can use gradations to indicate more fine-tuned scores and avoid these problems.

But as soon as we start doing that, we run into the problem of proper differentiation. I already posed the question of what separates a 7 from an 8. But what separates a 7 from a 7.25? 7.25 from 7.5? And so on and so on and so on. The same question confronts us all the way down.

But then we also run into the issue of insufficient differentiation. What if something is between a 7 and a 7.25? Do you then keep getting more and more fine-grained? Can one game be a 7.0258, and another a 7.0259? What does that indicate, then?

So maybe you use tenths of a point, or a 100 point scale. Or a 1000 point scale. But then the problem hits you from the other direction, too.

Because as you try and make scores more and more fine-grained, the actual individual scores lose their meaning. There’s nothing to really differentiate a 7.2 from a 7.3. There’s probably not even much to differentiate a 7.1 from a 7.4 or 7.5. Once you get down to that level, all you have is a general sense of what you kinda sorta think the score should be.

No one has a very clear and pre-defined measure of what these fine-grained scores mean. It would be absurd to do that. But because they’re not clear and pre-defined, they become poor measures. Once again, they’re just guesses.

Indeed, at that level the score of a game may change depending on a whole bunch of circumstances. Whether a game gets a 7.4 or a 7.5 may come down to when you last ate, whether you feel a bit under the weather, whether you got enough sleep, and so on.

It’s arbitrary. A fine-grained score looks professional and like someone put a lot of thought into it, but it means nothing.

Internal Consistency

The issue of arbitrariness also raises the question of how internally consistent these scores are. More importantly, how consistent they can even be.

In talking about all of this, we’ve been just talking about “reviews” in a generic sense. As though you can simply get some kind of sense of where a game should fall by playing it.

But that’s not how it works. Games are a combination of different elements – a narrative and graphics and sound design and environmental design and gameplay mechanics and a core loop and so on and so on.

How should those elements be weighted? If a game has amazing graphics, a superb narrative, but bad graphics, where does that fall? How about a game with decent graphics, a solid narrative, and really good sound design?

Much as we may try to take these things apart – either in our own minds or as part of a review – that process doesn’t work in practice. If we’re loving a game, we might ignore little glitches in the sound or visuals, or let them pass. If we’re hating a game, those issues stand out all the more. If you enjoy the core gameplay, you might ignore the story and not care about the narrative quality, even if it’s bad. Hating the core gameplay? Then you might hyperfocus on the story and refuse to take it seriously, regardless of its actual quality.

This is all ignoring the idea of external consistency – the idea that different reviewers may place different emphases on different aspects of a game. A 7 from one reviewer may mean something different from a 7 from a different reviewer. We’re not worried about that – it’s not a problem.

But because we’re trying to weigh all of these different elements together, a 7 that I give for one game does not actually mean the same thing as a 7 for another game.

At best, it merely indicates my overall experience, but that “overall experience” is the result of me trying to figure out how good I thought each little element is. How I weigh those elements is going to differ from game to game, because how does one even compare these kinds of qualities? Indeed, how do you even rate the individual qualities at all? Can you truly compare cel-shaded graphics to retro pixel animations to the most realistic-looking graphics of the day? Should it matter whether a game runs at 30 or 60 frames per second? Sometimes those questions could matter…and sometimes they’re only relevant if you decide that they are relevant. And sometimes even when you decide they are relevant questions, you may not apply those standards in the same way across different games.

Again, all of this boils down to the fact that no matter how hard you try, you can’t come up with a truly internally consistent framework for scoring video games on a numerical scale. You can put games onto that scale, but the scale has no real meaning behind it. Any individual score just means whatever you need it to mean at that moment, regardless of what it meant before.

Concluding Remarks

Our investigation is not done, but I wanted to stop here. I’ve looked specifically at the topic of review scores through the lens of creating those scores. The reviewer’s task of trying to explain how good or bad a game is means dealing with a whole host of issues, and trying to come up with a fully coherent scoring system requires far too much work.

There is plenty more that could be said against review scores. It is easy enough to find articles arguing for review scores to be eliminated entirely, and those articles will often raise additional points to what I’ve raised here. My goal is simply to tie the issue of scoring to underlying principles of the philosophy of science. Review scores are in many ways an attempt to make our subjective feelings seem more objective and scientific. Numbers are firm and certain, compared to the potentially uncertain and vague words you use to describe your experience with a game. But the number isn’t actually scientific. It is an illusion. It means nothing at its core.

In writing all of this, I should note that I don’t believe that reviewers who use scores are bad. The reason scores would be used at all is that they are quick bits of information to get across. Your audience may well not want to read multiple paragraphs of text, and might just want a very simple answer to the question of “is this worth my time?” A single number will generally convey the answer.

Though whether that number is actually helpful relies on so many assumptions that we should be hesitant to trust them. In reality, whether a number is helpful requires you to be familiar with a reviewer’s previous work and scores, so that you can put that number into context. If a reviewer gave one game a 9, and then you tried it and hated it, seeing another 9 from them is probably not a helpful indicator.

However, even where that’s true, it still will make plenty of intuitive sense to use numbers. We are often fixated on the idea of “data” and “objectivity” – topics I plan to tackle in the next essay. And that fixation means that people may well expect review scores. These numerical scales can in some way feel like something that we are pressured to do for one reason or another.

Attempting to come up with a numerical rating for any game is not a problem. All that matters is making sure we know the extreme limitations of any such system. To not take such scores seriously in the slightest.

Leave a comment