Video Game Difficulty and Player Sentiment Analysis

Model Implementation

1. BERT Transformer Model with “review” as input and “voted_up” as output

The BERT model (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer-based language model designed to understand text context bidirectionally. We leveraged it for text classification by fine-tuning it on our dataset to predict whether a review was positive (voted_up).

We will first load in our featured_reviews.csv into a dataframe. We take our “review” feature as X or our input and take “voted_up” as y or our output. The data initially looks like this:

Image Image

BERT model first tokenizes this text data before training. Before tokenizing our data looked like this

Image

The tokenized transformed data looks like this:

Image

Our processing involves the following steps:

The model training involved the following:

After training the BERT model, we evaluated it on the test set. Below are the results:

Image

The model performs exceptionally well for positive reviews (class 1), which dominate the dataset. Performance for negative reviews (class 0) is weaker, with lower precision and recall. The overall weighted metrics indicate a robust model for predicting review positivity

To try to improve our model we will incorporate more features into the next model.

2. BERT Transformer Model with “review”, “review_length” and “sentiment_score” as input and “voted_up” as output

We make another BERT transfomer model but incorporate two more numerical features along with the review column. We take “review”, “review_length” and “sentiment_score” as our input X and “voted_up” as our output y. Our data looks like this:

Image

The numerical columns are normalized by min-max scaling. The data now looks like this:

Image

The review column is tokenized in the same manner as the previous and looks like this after transformation again:

Image

The processing steps are similar to the previous model:

To incorporate both text and numerical features, a custom BERT-based model was designed:

The model was trained similar to the previous model:

After training this new BERT model, we evaluated it on the test set. Below are the results:

Image

Adding the additional features (review_length and sentiment_score) did not improve the model's performance. Overall, while there was a minor gain in precision for negative reviews, the drop in recall led to no significant improvement in the overall metrics.

3. LightGBM Classifier Model with “review”, “review_length” and “sentiment_score” as input and “voted_up” as output

We will now use a LightGBM classifier model to try to model the same relationship. The data looks like this again:

Image

The numerical columns are again normalized and the data looks like this:

Image

The review data is transformed for it to work with our classifier model. The transformation involves converting raw review text into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency). This step allows text data to be represented numerically for machine learning models. The steps are:

The transformed data looks like this:

Image

We used LightGBM (LGBMClassifier), a gradient boosting framework that is highly efficient for handling large datasets and sparse matrices (like our TF-IDF + numerical features). The model was initialized with the parameter class_weight="balanced" to handle class imbalance between positive (voted_up=1) and negative (voted_up=0) reviews.

The model was trained on the combined feature set of:

Hyperparameter Tuning: To improve the model’s performance, we conducted a Grid Search over a reduced parameter space to find the best combination of hyperparameters. The following parameters were tuned:

We used GridSearchCV with 3-fold cross-validation and optimized for the F1-score, balancing precision and recall

Threshold Adjustment: In the best parameter model, we adjusted the decision threshold (default is 0.5) to optimize performance for different class priorities. The steps were the following:

By lowering the threshold we improved recall for class 0 (negative reviews), reducing false negatives. But by increasing the threshold precision for class 1 (positive reviews) improved, which reduced false positives. To balance both out, a threshold value of 0.3 was chosen.

The results before hyperparameter tuning and threshold adjustment are the following:

Image

After tuning and threshold adjust, the following were the results:

Image

After hyperparameter tuning and threshold adjustment, the LightGBM model showed improvements in balancing precision and recall for Class 0 (negative reviews). Precision and recall for Class 0 became 53% and 54%, raising the F1-score from 52% to 54%. Class 1 (positive reviews) maintained strong performance, with an F1-score of 95%, and the overall accuracy increased from 86% to 91%. While the model successfully enhanced detection of negative reviews, the F1-score for Class 0 remains moderate due to class imbalance. The macro average F1-score improved to 74%. These results indicate that the tuning effectively balanced performance between classes, though further techniques might be required to address residual challenges in negative review detection.

4. Ensemble Stacking Model using LightGBM, Logistic Regression and Naïve Bayes with “review”, “review_length” and “sentiment_score” as input and “voted_up” as output

We try to use an ensemble model for modelling the same relationship. The dataset is the same which looks like this after scaling:

Image

The review is transformed again in a similar way as the last LightGBM model and it again looks like this:

Image

Ensemble models combine predictions from multiple base models to improve overall performance by leveraging the strengths of each model. We implemented a stacking ensemble model, which:

Our base models were the following: We used Logistic Regression as the meta-model, which learns how to combine predictions from the base models. Cross-validation (cv=3) ensures that the meta-model is trained on out-of-sample predictions

Each base model was trained on the training set. Predictions from these models were combined into new features for the meta-model. The meta-model was trained on the combined predictions to make the final decision

Hyperparameter Tuning: To further optimize the stacking model, we performed hyperparameter tuning using grid search. The grid search used 3-fold cross-validation and was optimized for the weighted F1-score. The following following parameters were tuned:

We will also do threshold adjustment from 0.1 to 0.9. A threshold value of 0.8 gave the most optimal results so we choose this.

Before hyperparameter tuning and threshold adjustment we got the following results:

Image

After tuning and choosing threshold of 0.7, we got the following results:

Image

The stacking ensemble model initially performed decent, achieving a macro F1-score of 79% and high precision and recall for Class 1 (positive reviews). However, Class 0 (negative reviews) struggled with a low recall of 51% and an F1-score of 61%, indicating room for improvement in identifying negative reviews. After hyperparameter tuning and adjusting the decision threshold to 0.7, the model showed a better balance for Class 0, with precision improving from 75% to 65% and recall increasing from 51% to 62%. Although the F1-score for Class 0 rose marginally to 63%, the performance for Class 1 remained robust, with a precision and recall of 96%. Overall accuracy remained consistent at 93%, and the macro-average F1-score improved slightly, indicating that threshold adjustment effectively enhanced the model's ability to identify negative reviews without sacrificing the performance for positive reviews

5. Bagging Model using LightGBM Classifier with “review”, “review_length” and “sentiment_score” as input and “voted_up” as output

We make an ensemble model by bagging the LightGBM Classifier and try to model the same relationship. The dataset is the same which looks like this after scaling:

Image

The review is transformed again in a similar way as the last ensemble model and it again looks like this:

Image

Bagging (Bootstrap Aggregating) combines multiple instances of a base model to reduce variance and improve stability. Each base model is trained on a random subset of the data (with replacement), and predictions are aggregated to produce the final result.

We implemented a BaggingClassifier using LightGBM as the base estimator. It has the following key parameters:

To optimize the BaggingClassifier, we performed hyperparameter tuning using grid search on a small search space to maintain efficiency. The grid search used 3-fold cross-validation and was optimized for the weighted F1-score. All 3 parameters are tuned:

After training the BaggingClassifier with the best parameters, we will also do threshold adjustment from 0.1 to 0.9. A threshold value of 0.4 gave the most optimal results so we choose this.

Before hyperparameter tuning and threshold adjustment we got the following results:

Image

After tuning and choosing threshold of 0.4, we got the following results:

Image

The BaggingClassifier model initially performed reasonably well, achieving an accuracy of 88% and a macro average F1-score of 74%. However, Class 0 (negative reviews) had a low F1-score of 55% due to its relatively low precision (43%) and moderate recall (74%). After hyperparameter tuning and adjusting the threshold to 0.4, the model showed an improvement in balancing performance for Class 0. Precision for Class 0 increased to 55%, while recall descreased to 58%, resulting in a better F1-score of 57%. The performance for Class 1 (positive reviews) remained strong, with a precision and recall of 95%. The overall accuracy increased to 91%, and the macro-average F1-score rose to 76%, reflecting a better balance between the two classes.

6. LightGBM Model with “review_length”, “sentiment_score” as input and “mentions_difficulty” as output

We will try to predict if a review mentions difficulty using review meta data like length and sentiment score. We will try to model review_length and sentiment_score to mentions_difficulty

This model aimed to predict whether a review mentions difficulty (mentions_difficulty) using two features: review_length and sentiment_score. Given the significant imbalance in the target variable, with far fewer reviews mentioning difficulty, a combination of under-sampling and over-sampling techniques was used to create a balanced training dataset.

Our Input dataset will initially look like this:

Image

Both the numerical columns are scaled and they look like this now:

Image

The output classes are imbalanced. They look like this:

Image

We use both undersampling and oversampling and the data now looks like this:

Image

The LightGBMClassifier was used for training, leveraging its ability to handle class imbalance with the class_weight='balanced' parameter. Hyperparameter tuning was conducted using GridSearchCV, with 3 fold cross validation, to optimize the following parameters with respect to F1 score:

After tuning we also do threshold adjustment from 0.1 to 0.9. A threshold value of 0.7 gave the best results

Before hyperparameter tuning and threshold adjustment we got the following results:

Image

After tuning and choosing threshold of 0.7, we got the following results:

Image

After hyperparameter tuning and threshold adjustment, the LightGBM model demonstrated improved balance between the two classes. Initially, the model achieved an accuracy of 76% and a macro average F1-score of 66%, but struggled to identify Class 1 (minority class), with a low precision of 34% and a moderate recall of 76%, resulting in an F1-score of 47%. By tuning hyperparameters and adjusting the decision threshold to 0.7, the model significantly improved its performance for Class 1, with precision increasing to 52%. However, recall for Class 1 dropped to 48%, resulting in a slightly better F1-score of 50%. For Class 0, both precision and recall improved to 92% and 93%, respectively, ensuring robust performance for the majority class. The overall accuracy rose to 87%, with a macro average F1-score of 71%, reflecting a better balance between classes while maintaining high accuracy.

7. XGBoost Classifier Model with “review_length”, “sentiment_score” as input and “mentions_difficulty” as output

We will now use an XGBoost Classifier model to try to model the same relationship as the last one. This model aims to predict whether a review mentions difficulty (mentions_difficulty) using features like review_length and sentiment_score. To handle the class imbalance, a combination of under-sampling and SMOTE was applied to create a balanced training dataset, followed by training an XGBoostClassifier optimized through hyperparameter tuning.

The input dataset looks like this again:

Image

After being normalized it looks like this:

Image

The output classes are imbalanced. They look like this:

Image

We use both undersampling and oversampling and the data now looks like this:

Image

An XGBoostClassifier was used for training. The following default parameters were applied:

Hyperparameter tuning was conducted using GridSearchCV, with 3 fold cross validation, and all the parameters were optimized with respect to F1 Score:

After tuning we also do threshold adjustment from 0.1 to 0.9. A threshold value of 0.8 gave the best results

Before hyperparameter tuning and threshold adjustment we got the following results:

Image

After tuning and choosing threshold of 0.8, we got the following results:

Image

Initially, the XGBoost model performed moderately well for Class 0, with a precision of 96% and an F1-score of 77%. However, it struggled with Class 1, achieving only 27% precision despite a recall of 84%, resulting in a low F1-score of 41%. The overall accuracy was 67%, indicating difficulty in balancing the two classes. After hyperparameter tuning and threshold adjustment (threshold = 0.8), the model showed improved balance. Class 1 precision increased to 48%, with recall at 51%, leading to a slightly better F1-score of 49%. Class 0 maintained stable performance, with a precision of 92% and recall of 91%, resulting in an F1-score of 92%. The overall accuracy rose to 86%, and the macro average F1-score improved to 70%, reflecting better handling of the minority class without sacrificing much performance for the majority class

8. Ensemble Stacking Model using XGBoost Classifier, Logistic Regression and Naïve Bayes with “review_length”, “sentiment_score” as input and “mentions_difficulty” as output

We will now use an ensemble model of XGBoost, Logistic Regression and Naïve Bayes to model the same relationship. This model combines the predictions from multiple base models using a stacking ensemble to predict whether a review mentions difficulty (mentions_difficulty). The stacking approach leverages diverse models to improve overall performance.

The input dataset looks like this again:

Image

After being normalized it looks like this:

Image

The output classes are imbalanced. They look like this:

Image

We use both undersampling and oversampling and the data now looks like this:

Image

Our base models were the following:

Logistic Regression was used as the meta model to learn how to combine predictions from the base models. 3-fold cross-validation (cv=3) ensured robust training of the meta-model

Hyperparameter tuning was conducted using GridSearchCV to optimize the stacking model:

After tuning we also do threshold adjustment from 0.1 to 0.9. A threshold value of 0.7 gave the best results

Before hyperparameter tuning and threshold adjustment we got the following results:

Image

After tuning and choosing threshold of 0.7, we got the following results:

Image

Initially, the model achieved decent results for Class 0, with an F1-score of 84%, but struggled significantly with Class 1, where the F1-score was only 46% due to low precision of 33%. The overall accuracy of 75% highlighted the model's difficulty in handling the minority class, despite reasonable performance for the majority class. After hyperparameter tuning and adjusting the threshold to 0.7, the model's balance between the two classes improved. For Class 1, precision increased to 46%, and recall to 53%, leading to a slightly better F1-score of 49%. Class 0 maintained strong performance, with an F1-score of 91%. The overall accuracy increased to 85%, and the macro average F1-score improved to 70%, showing better class balance while preserving good performance for the dominant class

9. Ensemble Stacking Model using LightGBM, Logistic Regression and Naïve Bayes with “review_length”, “sentiment_score” as input and “mentions_difficulty” as output

We will now use an ensemble model of LightGBM, Logistic Regression and Naïve Bayes to model the same relationship. This stacking ensemble model combines the predictions of three base models—LightGBM, Logistic Regression, and Gaussian Naive Bayes—to predict whether a review mentions difficulty (mentions_difficulty). The final meta-model uses Logistic Regression to combine the predictions from the base models for better overall performance

The input dataset looks like this again:

Image

After being normalized it looks like this:

Image

The output classes are imbalanced. They look like this:

Image

We use both undersampling and oversampling and the data now looks like this:

Image

Our base models were the following:

Logistic Regression combines the predictions of the base models, using 3-fold cross-validation (cv=3) to ensure robust training

Hyperparameter tuning was conducted using GridSearchCV to optimize the stacking model:

After tuning we also do threshold adjustment from 0.1 to 0.9. A threshold value of 0.8 gave the best results

Before hyperparameter tuning and threshold adjustment we got the following results:

Image

After tuning and choosing threshold of 0.7, we got the following results:

Image

Initially, the model showed decent performance for Class 0, with an F1-score of 84%, but struggled significantly with Class 1, achieving only 46% due to low precision (33%). The overall accuracy of 76% and a macro average F1-score of 65% indicated a clear imbalance in handling the two classes, with the minority class being poorly identified. After hyperparameter tuning and adjusting the threshold to 0.8, the model achieved better balance. Precision for Class 1 improved significantly to 58%, although recall decreased to 43%, resulting in a slightly improved F1-score of 49%. Class 0 maintained robust performance, with an F1-score of 93% due to high precision (91%) and recall (95%). The overall accuracy increased to 88%, and the macro F1-score rose to 70%.

10. Decision Tree, Random Forest and Logistic Regression Models with genre data as input and "mentions_difficulty" as output

For the next models we will try to predict the relationship between genres and the mentions of difficulty. We will use all our one hot encoded genre columns ('roguelike', 'co_op', 'base_building', 'soulslike', 'deckbuilding', 'puzzle', 'metroidvania', 'rpg', 'competitive', 'first_person', 'crpg', 'multiplayer', 'action', 'sandbox', 'fantasy', 'simulation', 'platformer', 'shooter', 'open_world', 'strategy', 'survival', 'adventure', 'crafting', 'third_person', 'turn_based', '2d') as input and "mentions_difficulty" as output.

Our input data looks like this:

Image

The output class is imabalanced and the counts look like this:

Image

We use SMOTE oversampling. The new counts look like this:

Image

We first implemented a Decision tree model with and then a Random Forest model both with the parameter class_weight="balanced" and other default parameters. Next we implement a Logistic Regression Model with max_iter=1000. All 3 models gave identical results which look like this:

Image

Hyperparameter tuning was done for all 3 of them:

  • Decision Tree:
    • criterion: ['gini', 'entropy']
    • max_depth: [5, 10, 15, None]
    • min_samples_split: [2, 5, 10]
    • min_samples_leaf: [1, 5]
  • Random Forest
    • n_estimators: [100, 200]
    • max_depth: [5, 10, 15, None]
    • min_samples_split: [2, 5, 10]
    • min_samples_leaf: [1, 5]
    • max_features: ['sqrt', 'log2']
  • Logistic Regression:
    • C: [0.01, 0.1, 1, 10, 100]
    • penalty: ['l1','l2']
    • solver: ['liblinear', 'saga']

Even after tuning, all 3 models yield almost the same results again, same as the ones before. These look like this:

Image

Since it produces nearly identical results with all three models, even after tuning, we can conclude that there isn't strong enough evidence in the data to model the mentioned relationship

For a detailed walkthrough of our model implementation process, you can view our full Jupyter notebook here.