Tag Archives: soccer

World Cup 2014 Data

It’s been a while since the last post I made on this blog, but it’s World Cup season so I had to contribute something.

I’ve collected some player/team data from the FIFA website which anyone can download and find interesting stuff. I’ve only put basic data in there, nothing too technical, but there is a collection of passing and tracking stats and a handful of other categories for every game so far: World_Cup_2014_group_stage < Click to download data in .xlsx format

If you use this data and see any problems with it let me know. For the USA-Ghana match, a handful of the stats didn’t seem to be published in the usual format so that one is incomplete. I also noticed that the high-intensity distance covered stats for the same game looked strange (probably incorrect) – use with caution.

Here is a small selection of charts based on the published dataset…

 

high-activity-dist-covered

 

(*NB I removed values for USA-Ghana in the above chart)

total-sprints-groupstage

total-passes-groupstage

top-speeds-groupstage

Srna and Di Maria pop up a couple of times with top speeds clocked over 31km/h. Aurier was observed at the fastest speed of 33.52km/h in the Ivory Coast-Colombia game.

top-20-distcovered-groupstage

For total distance covered Bradley makes 3 appearances in the top 20 for his efforts in all 3 games.

Model pitfalls and further discussion of TPOEM

Since my previous post introducing a new model for football analysis, TPOEM, I have developed and integrated some significant improvements to it.

Firstly the speed in which I can give predictions based on team starting line-up (involving less manual input, more automation) is much better, so last Saturday I was able to tweet about the model’s predictions well before the 3pm kick-offs began.

Secondly I have added a manager/leadership factor into the analysis which is dynamic and unique to each team.  This adjustment is intended to ‘smooth’ the team level aggregate scores that TPOEM calculates, where the model would not otherwise capture a persistent difference between a team’s results and their underying scores. This offsets (albeit not completely) the difference between the model’s league table compared to the actual league table. Why does that happen? Well, the basic underlying reason is the same as why a shots on goal league table does not reflect the real league table. I attribute this to a kind of quality factor that I am not picking up in the statistics I use: quality in terms of shooting can relate to the position on the pitch of a shot, whether defenders pressured the attacker and how much of a contribution the assist added to a goal scored. This quality factor will also incorporate a team’s record at home or away. For reference, the model currently seems to think that Stoke and Norwich are outperforming particularly well whilst Wigan, Southampton and QPR are all doing worse in the league than TPOEM suggests they should be doing. That might be due to luck, team playing style, management, player leadership, quality or all of the above. The model should now be slightly better at accounting for that.

Predicting part 2

So the first week of predicting using TPOEM brought me a net proft, although my biggest win was West Ham away win vs Stoke – and I’ve already explained that the model was distinctly anti-Stoke before the most recent update!

Again, as ever, I am seeking value so even if TPOEM suggests a probability of an event win/draw/loss of about 40%, if the bookmakers quote odds of 35% then I consider it an attractive bet. As it stands I haven’t been that selective about what I bet on: in fact so far I’ve been betting on every match that I ran the model for even though in many cases the model didn’t really suggest any particular value vs bookies.

The result this week, from 5 games, was another net profit, this time +26% return (it was +56% last time). But that came from 2 wins, 1 void, 2 lost bets, so in a sense the net result was neutral.  I profited overall because I weighted my bets towards the most attractive in terms of value – the biggest win being a draw-no-bet backing Everton at home to Man City. The model really liked Everton’s chances mostly because Kompany, Aguero and Yaya Touré were all missing for Man City.

I also backed draw-no-bets for Liverpool, Villa and Stoke: lost, won, void respectively. And lastly I went with a draw for Swansea-Arsenal (lost) but in retrospect I shouldn’t have bothered with that bet because the model gave no conclusive direction for the game and the odds weren’t good either.

As I reformat the model’s data and find a better way of communicating its predictions/results I will publish more information on the blog as I recognise I have kept most of the details pretty close to home so far. When I’m at my desk for the 3pm kick-offs I will also tweet about the model’s predictions so if you’re interested look out for that but if you bet then you are doing so at your own risk!!!

Introducing TPOEM

I must say I sometimes get irritated by the overuse of acronyms in today’s world but this time I’ve created my own. TPOEM rather unimaginitively stands for The Power Of Eleven Model which I have been developing over the past few weeks.

TPOEM is the culmination of fairly light research into simple OPTA-derived football statistics that I have been analysing over the past 6 months or so. Having only really put the information together over the past week or so, it is a bit foolhardy to discuss TPOEM in any detail right now – but I have already begun using it to objectively rate player/team performance and even test its efficacy at predicting match results.

I will give some detail into how the model works. The first point of note is that it is a bottom-up system.  That means that it primarily analyses player data first and team data second. There are many reasons I wanted to approach the analysis in this way:

  • A focus on player statistics gives an objective view of a player’s importance to a team, and can help indicate which players contributed most/least to a team’s performance
  • Player statistics like goals scored and assists are readily available and easily compared between players at different clubs
  • TPOEM can potentially capture information that is useful to understanding team playing styles
  • TPOEM can potentially be used to give a prediction of a match result based on the team starting line-ups, which will give a clearer expectation of a result if key players from either team are missing

Although TPOEM is derived from fairly simple statistics, the most recent iteration incorporates 36 statistics including stats from goals scored and shots on target to tackles and ground duels. I have weighted the utility of each action and applied success rates where available to give a rating in simplified categories:

  • Defending/Ball winning
  • Passing/Ball retention
  • Attacking
  • Discipline
  • Involvement
  • Goalkeeping

Of course the overall scores are adjusted so that the most frequent actions (passing, touches, etc) do not grossly outweigh the less frequent, but arguably more important, actions such as shots on target and goals scored. At the same time, I tried to maintain some care over the relevance of goals as a statistic – of course goals win games, but why should TPOEM rate attackers more highly than defenders because they score more often? Strikers often take all the plaudits for scoring goals but since most goals are scored inside the box I have tried not to unduly credit a goal scored – in many instances it is easier to score a goal than miss. I took a similar view of assists, seeking not to overly ramp-up a player’s score simply because he completed a pass (however important it was). I have to stress that it still wasn’t quite a finger in the air approach to rating – I have reviewed correlations to team performance at various layers with the aim of giving my weightings a scientific basis.

I have now tinkered with the algorithms enough times to realise that although TPOEM in one sense gives an objective rating of player performance, but in another sense remains a reflection of its creator’s biases and research. This is limitation of any model, which can only be improved by testing and further research.

What about results? Well I will keep publishing information over the coming weeks as I look to find suitable ways of presenting TPOEM’s output.

For now, I have run the model on the first 271 games of the premier league season (i.e. before the kick-offs on the 2 March), and I can announce its candidates for the most man of the match performances so far this season:

Player MoM awards
Santiago Cazorla 13
Gareth Bale 10
Adel Taarabt 8
Eden Hazard 8
Leighton Baines 7
Luis Suárez 7
David Silva 6
Dimitar Berbatov 6
Juan Mata 6
Marouane Fellaini 6

This highlights the importance, according to TPOEM, of Santiago Cazorla to Arsenal’s season in terms of match-winning performances. Both Manchester sides and Arsenal lead the team man of the match awards with 22 apiece, the difference being that there is a much larger spread of players who have put in top performances for United and City in the league.

Predicting

Those readers who follow me on twitter will have noticed that TPOEM liked the value of the chances of a home win for Everton and draws for Swansea vs Newcastle and Manchester United vs Norwich. Please note that this isn’t a direct match result prediction for the above – TPOEM actually had all 3 as odds-on for home wins, but the probability of a draw when compared to quoted bookmakers odds before 3pm seemed attractive at the time.

The main problem I had was in finding an efficient way to input all the line-ups in time for kick-off!

As it was, I completed my efforts and placed bets on all the 3pm kick-offs by 3.25pm – something I will have to work on going forward.

In addition to the above bets, of which only Everton’s home win against Reading paid off, I bet on a draw for Sunderland-Fulham (profit) an away win for West Ham (profit) and a win for QPR. 2 of these bets were actually placed live, with the scores at 0-0, whilst QPR were already 1-0 up at Southampton when I took the gamble of backing them to win. According to TPOEM, Chelsea were massive favourites at home to West Brom so I decided not to bother with a gamble on that game.

Most pleasing was the away win of West Ham at Stoke – a game which I am sure could just as easily have gone either way. When I ran the line-ups through TPOEM West Ham had actually already made 2 early substutions so I incorporated those new players into the line-up. The model indicated about a 30% chance of West Ham winning which was attractive enough when compared to quoted odds of about 9/4. Fortunately for the early prospects of TPOEM they duly achieved an unlikely result at the Brittania.

I will continue to test TPOEM’s predictive efficacy vs bookmaker odds but for any followers of the blog, please note that I am seeking value not outright wins. Even if Manchester United are heavy favourites to win at home, as they were at the weekend, I may suggest another outcome if the odds are attractive enough depending on what my early-stage model tells me!

Feeding off scraps in the Premier League?

Having looked at the top scoring strikers in the league in a previous post on the race for the golden boot, I now turn my attention to shooting statistics for the leading target men at teams in the bottom half of the table. Players for these teams often ply their trade as a lone striker, with less than average support from midfield. As a result the pressure on them to score every gilt-edged chance is high since every goal is precious for their club to ensure survival.

After only 16 games of the season played these players all have 6 goals or less, so each goal or missed opportunity has a strong bearing on their stats (disclaimer!).

The strikers considered this time round are Djibril Cissé (QPR), Christian Benteke (Aston Villa), Adam Le Fondre (Reading), Arouna Koné (Wigan) and Rickie Lambert (Southampton). Cissé, who has played the least in terms of outfield minutes, has also scored the least with only 2 goals for winless QPR. Rickie Lambert is the most prolific goalscorer so far with 6 goals for Southampton. At the time of writing QPR sit 20th in the league with 7pts and Reading just ahead of them on 9pts, whilst Wigan, Aston Villa and Southampton are all level on 15pts. All stats correct as at 12 December, using EPL Index / Opta data.

Efficiency 11 Dec Goals & Shots per 90 11 Dec

Goals & Shots per 90 Data 11 Dec 2Of the 5 strikers, Arouna Koné takes the fewest shots with only 2.49 per 90mins, on average this is far less than Cissé, Benteke and Le Fondre, who each manage to shoot over 3.5 times per 90mins. But shots alone do not necessarily indicate the quality of opportunities on hand – indeed the current league top scorer Michu currently has a shots per 90mins rate of 3.13. Cissé’s low shots on target rate at under 30%, of which only a paltry 20% have been goals, has not done much to help QPR’s cause.

Le Fondre and Lambert are easily outperforming the others from this perspective because the quality of their shots is shown to be generally much higher – and so although they take fewer shots per game their goalscoring rates are significantly better off (c0.45 goals per 90mins). Lambert has a particularly good record of making the opposition keeper work when he has a shot: he has hit the target 47.4% of the time.

Big Chance Data 11 Dec

Big Chance Economy 11 Dec Big Chances 11 Dec

When it comes to big chances, Koné in particular fares poorly.  Although both he and Cissé have a conversion rate of 25%, Koné has had several more gilt-edged chances than Cissé (12 vs 4 respectively). Roberto Martinez will no doubt be disappointed by the return from Koné, however on the plus side the sheer frequency of big chances he is involved in may be a positive sign for the team’s prospects. The small sample size for Cissé means that his conversion rate of 25% perhaps does him a disservice at this point in time – if he scores his next one it’ll jump up to 40%.

Benteke, who in recent weeks has kept Darren Bent out of the Aston Villa team, does not perform particularly well in this analysis. Judged purely by the stats in this post he resembles Cissé much more than Lambert, with below average shooting accuracy and below par big chance conversion.

Of the group, unsurprisingly it is Lambert again who does best with big chances with an excellent conversion rate of 75% (3 from 4). When Southampton have needed him most so far he has come up with the goods, but whether that form continues for the rest of the season is another matter.

Premier League 2011-12: Player Impacts – discussion

In previous posts I have tested different ways of rating players using Opta data to mark out key fields for each major position which correlate positively to points.  The summary of these reviews can be read here.

What troubled me about some of the findings in this process was the underperformance of some high-profile players whose strengths were clearly not rewarded by the analysis. For example, Ashley Cole, Theo Walcott, and even Fabricio Coloccini – who actually made the PFA Team of the Year last season. Although I’m pretty keen to separate subjective opinion from raw data analysis, in particular the presence of Coloccini in the PFA Team of the Year – voted for by fellow professionals – cannot be disregarded lightly. Not to mention his superb performance at the weekend!

So in this series of posts I have published another ‘view’ of footballers – this time looking at team performance in the league with and without a particular player in the starting line-up. This can be used as a simple indicator regarding which players’ presence helps/detracts from their team. I used Tableau Public for the first time for this, and had some teething issues attaching my graphs/tables, so they are shown in separate posts below.

Method

I calculated the average points gained, team goals scored and team goals conceded for every team and player and compared this to the team averages without that player in the starting line-up. Of course, those who started every game don’t have a ‘without’ average so I removed players who started every game. In addition, I took out players who started fewer than 4 games, and players who started more than 34 games. I did this on a whim after I saw that Robin van Persie had a negative impact to Arsenal’s points average – this happened because he started 37 games for Arsenal last season, and in the 1 game he didn’t start Arsenal won against Stoke. This annoyingly made Arsenal’s points average without RVP as 3pts per game, which is a bit ridiculous when he came off the bench and scored 2 in that game anyway! Players with 1 start had a similar problem, as the result of that game determined their impact. That example serves a purpose in explaining the limitations of a data table like the one below, even though the bias is reduced by increasing the min/max number of starts to 4 and 34. Of course if a player started in 34 games but the 4 he missed were away visits to Man City, Man Utd, Arsenal and Spurs then again his points average is more likely than not to be a little too high.

All the impacts below need to be taken with a pinch of salt but information is power, and I think this review is complementary to my previous player analyses and will help to give a better profile of players and their contribution to team performance. Incidentally, in this review Coloccini didn’t qualify because he started 35 games last season.

Hopefully, the tables/graphs are self-explanatory, but here are some highlights:

  • Adebayor for Spurs had the biggest positive effect on points for any team, followed by Arteta for Arsenal
  • Theo Walcott and Ashley Cole both had a strong positive effect for Arsenal and Chelsea respectively despite the poor stats analysis rating in previous posts
  • Notable ‘unlucky mascots’ for their teams were Berbatov for Man U and Ramsey and Arshavin for Arsenal
  • Swansea had a comparatively short range of differences between their players, which shows not only that they were able to field a remarkably consistent team for much of the season, but also perhaps indicates that no matter who was in the starting line-up, the player positions and tactics were relatively easy to substitute

Premier League 2011-12: Player Impacts – average points

The below graph, created using Tableau, shows the difference between points earned last season with that player in the starting line-up, vs points earned without (positive is good!) sorted by team.

Qualifying players were in a team’s starting line-up between 4 and 34 times to create a ‘sensible’ average points difference. For more information on the methodology click here.

An interactive version of the graph is available at the following link:

http://public.tableausoftware.com/views/EPL2011-12GlobalPlayerImpact/AvgPointsdifference?:embed=y

Premier League 2011-12: Player Impacts – goals for & conceded

I used Tableau to create the following graph of the positive/negative difference relating to goals for/against based on team averages with/without that player in the starting line-up that season.

Qualifying players were in a team’s starting line-up between 4 and 34 times to create a ‘sensible’ average difference. For more information on the methodology click here.

Use the version linked to below and hover over data points to see which player each star represents. NB. positive numbers are good for both goals for and goals conceded.

http://public.tableausoftware.com/views/EPL2011-12GlobalPlayerImpact/GoalsForvsCon?:embed=y