A Data-Dive into the Video Game Industry
- Shail Mirpuri
- Aug 24, 2020
- 11 min read
Updated: Sep 9, 2020

Source: Tech Crunch
Overview
In the last decade, video games have reached new heights with the development of technologies that have led to limitless possibilities for creators. In fact, in 2018 the industry hit a new record with total sales exceeding $43.4 billion. With the recent boom in the total production, it is indeed important for video game producers to understand the factors that determine an individual game's total sales. With this invaluable insight, producers will be able to alter their production to maximise total revenue.
I have attempted to answer these riveting questions and more by analysing over 16,000 rows of video game sales data. We will first start by analysing the distribution of video games sales in our dataset. After this, we will dive further into the correlation of global sales with other factors in our dataset. Following this, we will compare and contrast the different subsets of video games such as genre, series, and platform. Finally, I will summarize the extensive process of model building and tuning that I underwent to generate an algorithm that can predict future video game sales. If you'd like a more in-depth look at this algorithm and the rest of my analysis: Click Here!
Distribution of Video Game Sales
A brief exploration into the distribution of global sales for video games may be useful in determining the likelihood of a 'hit' product in the industry.

From the figure above, we can easily tell that the distribution of global sales in the video game industry is right-skewed. This is confirmed upon computing the skewness, which has a value of 17.4. Delving further into this, we find that the kurtosis is 603.93. The kurtosis provides a measure of whether our data is heavy tailed or not. The fact that we have such a high kurtosis suggests that there is presence of extreme outliers in global sales. Thus, this suggests that the distribution of global sales in our data is heavily right skewed. This could imply that while majority of the games produced are relatively unsuccessful, we have a few 'hit' games that breakout significantly and make up a large chunk of the industry's total sales. The presence of significant outliers in the data will be addressed prior to our model building using a statistical technique known as Winsorization. This method will reduce the overall effect of the outliers on the training of our model in order to prevent overtraining.
What variables in our dataset are correlated?

From the correlation matrix above we can see that the sales of certain regions have increased over time, while others have decreased. The most prominent case of declining sales can be seen in Japan. It seems that the buzz of the video game industry has slightly died down in Japan over time and potentially has shifted to other markets. Furthermore, a slight negative correlation between years and global sales implies that although global sales has decreased over time, there are other, more significant factors impacting it. We can also see a high positive correlation between global sales and the sales of each region. This is expected since global sales is actually made up of these factors. It is, therefore, important to exclude these features in our model building as they act as a source of data leakage. We will now dive deeper into how global sales have evolved over the years.
Video Game Sales Over Time

The scatter plot above illustrates some key insights about the transformation of the video game industry through the years. Back in the 1980s, when it first emerged, there were very few games produced. As time progressed, the amount of products in the market slowly began to rise, which may be due to the rapid development of technology during this time. It can also be seen that there were significantly more 'hit' games (those that breakout from the rest of the pact) at the turn of the 21st century. Although the correlation coefficient implies a slight negative correlation, my interpretation is that there was a steady rise in global sales over time until it hit its peak at around 2007/2008. A potential explanation for the fall of global sales after 2007 may be due to the invention of other entertainment technologies that are not included in this dataset such as smartphones. A large amount of the gamers, possibly casual ones, seem to have moved off traditional video games and over to other platforms. It is also important to address that our dataset only has a small sample of sales data for games after 2015.
Best Selling Genres

The table above shows the top 7 best selling genres by median global sales. We used median as our method of aggregation since, as established earlier, global sales is significantly right skewed. This means that the median would provide a better measure of centre since it is more robust to outliers than other measures like the mean. From this table, we can see that Platform games tend to be the best selling ones in the industry. These are games that involve players controlling their character and avoiding obstacles. Some notable 'hit' games in this category include Super Mario Bros and The Legend of Zelda.
Comparing Sales of Popular Series

From the table above, we can see that far and away the most successful series in terms of median global sales is the Halo series. It is important to point out, however, that we have only compared the top 8 most popular game series. Furthermore, if we look at the correlation between global sales and time, we can see that games such as Fifa, Call of Duty, Halo and Grand Theft Auto have become more successful as time has gone by. This may reflect the shift into next generation consoles such as the Xbox One and PS4, in which these games have really thrived through improving graphic quality and game functionality. Contrastingly, games such as Mario and Pokemon seem to have decreased in total sales over time. This may be expected since these games tend to be considered as 'classics', meaning as technology has developed, the opportunity to really enhance a gamer's experience from previous versions may be limited. Another important thing that may have influenced the shift in popularity from Mario and Pokemon to other games such as Call of Duty and FIFA is the demand for multiplayer online gaming and the rising hype of E-Sports that has come about due to increase in tournaments and lucrative prizes.
At the bottom of our comparsion lies Wii Series games. This is peculiar since in our dataset the maximum global sales for a single game is actually from the Wii series: Wii Sports. This may reflect the rapid rise and subsequent sharp drop of the Wii hype. When it first came out it was widely touted as a game-changer in the video game industry with its use of new technology that tracked controller movement. However, as time went on its share in the market plummeted with gamers switching over to the next generation of consoles: Xbox and Playstation. We can see this reflected in its relatively unsuccessful game series, who's sales have declined over time.

What devices are gamers playing on?

Due to the release of many different devices over the last few years, it is hard to fairly compare each of these directly against each other. Instead, we will group these devices into three categories: console, portable and PC. We can see that over time console games seem to be significantly better selling than portable and PC games. It is also interesting to observe that the sales of PC and portable games have declined slightly over time, while console games have held relatively constant. This suggests that it may make more sense for a video game creator to focus their production into console games as these seem to have higher sales while also being robust to changes in longer term trends.
Feature Selection and Engineering
We will now move on to the process of building a model that can predict the global sales of any video game. We need to engineer some features from our dataset in order to make them more informative. I have engineered 3 key features of each video game:
1. Device Type:
As we have seen from our exploratory data analysis, it seems that the type of device a video game is released on has an impact on its global sales. Rather than including all the different individual devices, it makes more sense to group them into categories as this will allow our model to perform significantly better. Furthermore, we can analyse this feature after our is model built in order to see whether it plays a significant role in the sales of a video game. This will indicate to producers whether it makes more sense to focus on prioritising games on certain device types.
2. Series:
In order to try and account for the actual content of a game, we will consider whether or not it is in a series. Since series games tend to be improved versions or continuations of each other, this can have a huge impact on the global sales of a game by drawing in returning consumers. Again, we will consider the top 8 most popular game series when engineering this feature. If a game is not part of these series then it will receive a value of false in each column. This feature can provide insight on whether or not it makes sense for producers to invest time and money into creating and maintaining a series rather than releasing all features into a single game.
3. Distribution of Global Sales:
We will also ignore the gross number of region sales and instead look at each as a percentage of the total global sales. This is because the gross region sales are highly correlated with the global sales since they actually make it up. If this is left in our model, it would be a data leakage for our target variable. This means that we won't have this information when making predictions for global sales of a game. Therefore, even though it will work on our current dataset, it will inhibit the model's future use in the real world. Despite this, it would be interesting to see if the distribution of sales by region impacts the actual global sales number. Thus, we will look at this by expressing it as a percentage of global sales. This will allow us to investigate whether or not targeting specific regions can lead to an overall breakout in global sales.
Summary of all features selected
Year
Genre
Publisher
Device
Series
Percentage of Global Sales from North America
Percentage of Global Sales from Japan
Percentage of Global Sales from Europe
Percentage of Global Sales from Other Regions
Let's Finally Start Model Building!
In order to start building a suitable model for our dataset we have to consider different baseline models. Using Python's Scikit Learn library, I selected 6 different baseline models and one voting regressor. The performance of each model can be compared using cross validation scoring and taking the average of all scores. Cross validation scoring breaks our data up into 'n' different sets (we will use n=5) and uses 'n-1' sets to train the model, which is then tested on the remaining 1 set by calculating a score. This is then repeated n times until all sets of the data have had a chance to be the testing set. The model's overall performance will be measured by taking the average of all the scores, which in our case will be the mean absolute error. The mean absolute error (MAE) provides a snapshot of the model's accuracy by averaging how much the model's predictions are off the actual values.
Here is a summary of our baseline model results:
1. Linear Regression: 0.670
2. Decision Tree Regressor: 0.569
3. Random Forest Regressor: 0.570
4. Bagging Regressor: 0.572
5. Extra Trees Regressor: 0.569
6. Gradient Boosting: 0.566
7. Voting Regressor: 0.561
As we can see, the voting regressor model seems to have performed the best. This is reflected by it having the lowest MAE (in millions) relative to the other models. A voting regressor makes use of multiple models, each of whom 'vote' by computing the value of global sales using their algorithm. The average of these values is then taken and this is what the model outputs as its predicted value. The voting regressor we tested is made up of all the other models with the exception of the linear regression model. By being able to account for the advantages of each of these models, the voting regressor is potentially more reliable especially in its application to unseen data in the future.
Despite this, I wanted to improve the performance of our overall final model. In order to do this, I have tuned these baseline models using a method of hyper parameter optimisation.
Model Tuning
Out of all the models in our Voting Regressor, the one that we can tune the most to achieve a significantly greater performance is the Random Forest Model.
There are countless number of parameters that can be changed in a Random Forest Model. In order to save time, we ran a random search of these parameters. This randomly samples a specified amount of Random Forest models with different parameters. It will then return us the best parameters out of all the iterations performed. From this, we will be able to use these parameters in order to improve our current Random Forest model.
After performing hyper parameter optimisation on the random forest model, I was interested in tuning the weights of our voting regressor to see which produced the best results. Through the use of Grid Search, I found that the best weighting for our Voting Regressor was a weight of 1 for all models except for the Gradient Boosting model, which was given a 2.
Using our optimised Random Forest parameters and Voting Regressor weights, we will now go on to develop our final model.
The Final Model
In building our final model we will use a Voting Regressor that consists of the best Random Forest, Gradient Boosting, Decision Tree, Extra Tree, and Bagging Regressors with weights favouring the Gradient Boosting model slightly more. In order to gain insight into our final model's key performance metrics, we will randomly set aside 1000 rows of data for testing (about 8% of the total data). The remainder of the data will be used in training the final model.
When this model is applied to the test data, it has a mean absolute error of 0.333. This is around a 40% decrease in error when compared to our baseline models, which stands testament to the importance and power of hyper parameter optimisation. Furthermore, we can also analyse the R2 score for our model, which was 0.679. This means that 67.9% of the global sales for a video game can be attributed to the features in our model. The reminder may be due to other factors that can't be explained by our model. While we will be able to get a value for the global sales of a video game using this model, it is important to note that this value is just an estimate and we need to interpret it with caution. That being said, it would still be interesting to analyse the importance of each feature in this model.

We can identify the most important features of a video game using a method of permutation importance. This method randomly shuffles different features in and out of the model and observes how each of these features impact the model's performance. We can see that the distribution of a game's sales by region seems to play a huge role in the overall success of the product. This is particularly the case for regions such as Japan and North America. We can also see that Mario and Pokemon games are more likely to have higher global sales than games of other series. These key insights can inform future product choices for video game producers.
Although this model has tremendous capabilities in terms of predicting global sales with decent accuracy, it is important also to address its limitations. Firstly, there are several factors affecting a video game's sales that could not possibly be accounted for in our model. For instance, trends and hype surrounding a game is not reflected in our model. This can be seen in newer games such as Fortnite, which has been incredibly successful and received countless number of celebrity livestreams even though it is not part of a set series. Despite these limitations, this model helps give companies a good starting point of what to expect in terms of their global sales, which can help inform key planning decisions such as inventory of the game and expected traffic flow to online servers.
Key Insights
Overall, we have delved deep into the one of the most successful entertainment industries today and come out with some key insights. We have discovered the value of 'classics' such as Pokemon and Mario in achieving large amounts of global sales. Using this key insight, video game producers may find value in producing products that have a timeless, character-based element in them like these games. Furthermore, it seems that those who target their games towards Japan and North America are more likely to achieve a greater total sales. This may also suggest that these regions are very influential in informing gaming trends around the world. As this ever-evolving industry develops, the next phase of gaming is still unclear, but whatever happens, it seems like its success depends largely on how its received in the influential regions of Japan and North America.
References
Commentaires