Video Game Difficulty and Player Sentiment Analysis

Conclusion

Project Summary

Our project aimed to explore how players perceive and discuss video game difficulty through their reviews. Using a dataset of 43,000 reviews from 30 diverse games, we analyzed the relationship between game difficulty and player sentiment across genres, metadata, and review content. Our research questions were addressed using a mix of machine learning models, statistical analyses, and visualizations.

Key Findings

Review Sentiment:

Positive reviews were predicted effectively with high accuracy (93%) and an F1 score of 96%, though identifying negative reviews remained challenging.

Genres and Difficulty Mentions:

Attempts to predict difficulty mentions using game genres showed poor results, with models achieving only 57% accuracy and an F1 score of 30% for difficulty mentions. This suggests a weak relationship between genres and difficulty mentions. "Co-op" and "metroidvania" games had the highest proportion of reviews mentioning difficulty, at 18%+, while "competitive" and "multiplayer" games had the lowest proportions, at 7-8%. These mentions included discussions of both “hard” and “easy” difficulty.

Player Experience and Sentiment:

Beginner players gave lower sentiment scores compared to intermediate and experienced players, with differences confirmed to be statistically significant by our tests. Players with shorter playtimes (less than 5 hours) were more likely to mention difficulty, with mentions decreasing as playtime increased.

Summary of Modelling Results

Accuracy of the Models

Image

This chart highlights the overall accuracy of the models used in the project. The highest accuracy (93%) was achieved by several models, including the BERT-based models and ensemble stacking models for predicting whether a review is positive. However, the genre-based models for predicting difficulty mentions struggled, with an accuracy of just 57%, underscoring the challenge of modeling this relationship.

Class Specific and Macro Average F1 Scores

Image

The grouped bar chart provides a deeper look into the models' performance for individual classes (Class 0: Negative/No Difficulty Mention and Class 1: Positive/Difficulty Mention) and their macro-average F1 scores. While the majority class showed strong F1 scores (up to 96%), The minority class often lagged behind, with F1 scores as low as 30% for some tasks. This discrepancy highlights the impact of class imbalance and the need for more nuanced modeling approaches.

Addressing Research Questions

Significance of Findings

For Developers:

These findings provide actionable insights to tailor game difficulty to target audiences. Developers can design difficulty settings or onboarding processes that cater to both beginner and experienced players.

For Players:

Players can use this analysis to identify games suited to their preferences and skill levels, avoiding potential frustration or finding rewarding challenges.

For Platforms:

Gaming platforms can implement smarter recommendation systems that consider player sentiment and difficulty perception.

Potential Use Cases

Game Balancing:

Developers can adjust difficulty settings or add dynamic difficulty adjustments based on player feedback trends.

Player Insights:

Gaming platforms can use these findings to recommend games that align with individual preferences.

Community Engagement:

Insights about difficulty-related frustrations can help developers address player concerns in updates or sequels.

Limitations and Future Improvements

In-Game Metrics:

Our analysis relied heavily on metadata such as playtime and the number of games owned, which do not fully capture a player's in-game behavior or progression. Including additional in-game metrics, such as achievement progression, difficulty settings, or completion rates, could help us better understand how players perceive and respond to game difficulty over time.

Modeling Challenges:

The models faced difficulties with imbalanced data, leading to weaker performance for underrepresented classes such as negative reviews or difficulty mentions. Simpler models like Decision Trees also failed to capture meaningful relationships, particularly between genres and difficulty mentions. Future work could address these issues using more advanced data balancing techniques, gaming-specific fine-tuning for NLP models, and hybrid approaches that combine advanced NLP with rule-based or ensemble methods.

Player Metadata and Experience:

Models using player metadata, such as playtime and number of games owned, showed moderate success in predicting difficulty mentions but struggled to consistently capture nuanced patterns. Adding features such as solo versus co-op gameplay styles or player-specific behaviors could enhance the predictive power of metadata-based models and provide a more complete picture of player preferences.