Apart from an excuse for a pretentious title, I aim here to discuss some of the intricacies in the language used in football analytics. If you are new to the field, it might be worth reading One Short Corner’s fantastic primers.
There’s a brilliant video of Richard Keys and Andy Gray discussing Wenger’s use of the term ‘expected goals’. “I’m not buying into that one,” says Keys, before continuing, “please don’t tell me a stat can tell you when a player is going to score.”
It’s nothing new to say that the name is problematic. But as football analytics continues to gradually permeate the mainstream, it is a problem of increasing importance.
It is one best illustrated by an example, where you are the model. On a scale of 1-5, where 5 is absolute certainty of a goal, rate the following chance:
Now it doesn’t really matter what you actually rate it, because the point is you’re digesting the information available to you at this point of action. You have where the shot is being taken from, like a basic expected goals model, and where the defenders are positioned.
What happens if we add more information?
Some more pre-shot information, mainly that the shot is a product of a Suarez-led counter-attack with defenders out of shape and in recovery mode, makes the chance seem a better one. The likelihood of a goal is probably higher than we thought it was before.
What happens if we add where the shot went?
Neymar’s finish is a good one, drilled at the bottom left corner. Regardless of where you started on the scale of 1-5, you should be closer to 5 now.
But there’s a subtle difference in what we’re talking about as soon as the post-shot information (where the shot went) is included. We’re no longer talking about ‘chance quality’ in the footballing sense, but we are (confusingly) still evaluating the ‘chance’ of a goal.
The colloquial definition of ‘expected goals’ is an objective version of when someone sees a striker miss an opportunity and goes “he should have scored there.” At this point, the quality of finish is irrelevant, it is about trying to categorise the chances independent of it.
The ‘expected goals’ model, though, one aiming to predict future goals or do so with the least error, would be improved by the post-shot information. It would also be improved by information like the current score, who is home and away, what league the match is in, and so on.
The real problem is we only have one name.
The pre-shot model is useful for universalising opportunities and evaluating the sort of situations teams or players are getting and creating independent of their finishing, which we know varies hugely.
The post-shot model is one more useful for simulating matches that have already happened, assessing likely match outcomes, and perhaps in looking into shot-quality.
Going forwards, it is impossible to propose unanimity in the models used – information varies depending on the data source, as does methodology. But we can loosely agree on the terminology that we use.
I propose clarifying ‘chance quality‘ for models that don’t include post-shot information. I would personally argue that these shouldn’t include game state (the score at the time) or league effects either, but that may be a matter for personal taste.
Then, ‘expected goals‘ can be a more self-explanatory term (unless you are Andy Gray) for other models: an attempt to predict goals as accurately as possible with all of the information available.
It may seem tedious, but slightly altering the terminology is helpful because it a) ties into football lingo, where ‘chance’ is a noun that means opportunity, and b) helps alleviate constant confusion about the inputs of a model. There may also be more efficient ways to do this, and so I’d be interested to hear thoughts from others.
Maybe we should give it a….chance?