Machine Learning Term Project

The Greatest
May 14, 2021
5 min read

please click on the pdf:

Link to download notebook

Link to see this on kaggle

Link to check out working demo

Task: To Predict the Ratings using BoardGamesGeek-Reviews dataset Author: Rohit Singh ~~ ID:1001718350

Introduction:

The bgg dataset: https://www.kaggle.com/jvanelteren/boardgamegeek-reviews

Here we have few types of classification algorithms in machine learning that we used to implement our project:

Logistic Regression
Naive Bayes Classifier
Support Vector Machines
Random Forest

Logistic Regression (Predictive Learning Model) : It is a statistical method for analyzing a data set in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). The goal of logistic regression is to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. This is better than other binary classification like nearest neighbor since it also explains quantitatively the factors that lead to classification.

Naive Bayes Classifier (Generative Learning Model) : It is a classification technique based on Bayes’ Theorem with the assumption of independence among predictors. In other words , a Naive Bayes classifiers assume that the presence of a particular feature in a class is unrelated to the presence of any other feature or that all of these properties have independent contribution to the probability. This family of classifiers is relatively easy to build and particularly useful for very large data sets as it is highly scalable. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Support-vector machines : These are supervised learning models with associated learning algorithms which analyze the data used for classification and regression analysis. In the light of a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm creates a model which assigns new examples to one or the other category.

Random Forest: Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of over fitting to their training set.

Preprocessing Data ~Feature Selection

Only select Features that most apply to your predictive variable or performance you 're interested in. With irrelevant features in your data, the model's accuracy can be reduced drastically. so, in order to do that we,

purge all rows containing null values in "comment" column

Checking for Imbalance in dataset with respect to number of ratings

it is always wise to sample out unbiased data before training

Down sampling the imbalanced data

since we only have 20000 data samples for the user rating 1.0 in the whole dataset, we took random sample of 20k from every other rating to maintain the bias which can negatively impact the overall accuracy for predicting ratings accordingly

Contribution1: Clear text from jargon words for better accuracy

One of the most significant contribution which helped to estimate relevant words for modeling and predict much accurate ratings

Conventional Approach

By comparison, the pretrained network's dense layers are replaced with a traditional classifier. The convolutionary base's production is fed directly to the classificators. The traditional classifier is then focused on the features extracted to arrive at a definitive result

Baseline Models

A list of transforms and a final estimator are implemented in sequence. The pipeline's intermediate steps must be 'transforms,' that is, fit and transform methods must be introduced. The final estimator only has to be prepared to enforce.

Pipelines for Hyperparameter tuning

Pipeline offers a simple and intuitive way to organize our ML flows, which are characterized by a consistent sequence of core tasks including pre-processing, selection of features, standardization / normalization and ranking.
Pipeline automates several instances of the fit / transform process by successively calling each estimator to fit, applying transform to input and passing the TfidfVectorizer.

In this Experiment the models are tested for different number of features and choose the best fit hyperparameter to test dataset

Triune Pipelining

Construction of the so-called Triune Pipeline, whose component pipelines can convert raw text into what is no more than the three key building blocks of NLP tasks combined with varied no. of features.
This feature selection technique works exceptionally well when we deal with enormous dataset that involves redundant training on a dataset to achieve desirable output
- In my case, I trained the cleaned dataset for varied no. of feature pruning

Each pipeline of components includes a transformer that will output a major type / representation feature in NLP. I also showed that, particularly in conjunction with RandomizedSearchCV, we can combine distinct feature sets resulting from different pipelines to achieve greater results. Another good takeaway I think is the value of combining ML-driven and rule-based methods to boost model performance.

Linear SVC Model outstands all other classifiers with Accuracy of 83.2%

Optimal Hyper-parameter: max-feature = 500000, which is greater than MNB and MNB-N-grams and any other classifiers we tested

Findings

After performing unbiased sampling and Optimizing severals features on classifier, the throughput we obtained was much more accurate and similar to what we as a user might rate in a real world. so, below is an example to demonstrate the working of our model

Conclusion & Contributions

As we can discern a noticeable difference, the unsampled(biased) raw data when passed into the classifier which than hampers our models prediction with a fairly low accuracy of ~16% over sampled train set that we’re able to predict with an accuracy of ~83.2%. Challenges faced

Textual Representation: words having same meaning have a similar representation. It is this approach to representing words and documents that can be considered one of the main breakthroughs of deep learning on the task of solving natural language problems.

Contributions

In this experiment we successfully processed BoardGamesGeek-Reviews Dataset and trained our best fit model to generate predictions for rating based on user's review.
My model involves comparison of baseline classifiers over pipelined models with an approach to a more accustomed feature selection technique, purges all inadequate words in the datalist which does not contribute in prediction. Helps algorithm to learn over relevant words
Incorporated several modeling features selection technique to enhance the scope of ML classifiers for better throughput. several classifiers were tested on the same dataset rigorously and compared their test accuracies.

We can see after training models on same dataset with different attributes imbued varied levels of accuracies which showed an overall hugh success in linear svc model with an exceptional accuracy of 83.2% for trained sample dataset. This model can now be used for predicting ratings with an optimal score that matches to a great extent with an actual human rating in real world

References: