Executive Summary

This project is completed in R using the following methods: NLP/text mining, logistic regression, LASSO, PCA, and random forest.

This project is aimed to predict star ratings of unlocked mobile phones on Amazon.com using machine learning techniques. The Star Rating Prediction dataset from Kaggle is utilized which contains features like product name, product brand, product price, review rating, review text, and the number of votes received a view received.

The exploratory data analysis revealed insights such as phone price range, review votes distribution, average ratings by brand, and the distribution of phone ratings. A two-category response variable was created to classify reviews as 5-star or non-5-star. A document term matrix (dtm) was generated for review text after preprocessing steps.

A LASSO model was trained on all features, achieving an error rate of approximately 20.0%. Relevant words associated with 5-star reviews were identified using LASSO and logistic regression (24.6% error). Principal Component Analysis (PCA) was applied to the LASSO model but did not improve model performance significantly.

A Random Forest Classification model was implemented on both original and PCA-transformed data, with the non-PCA model outperforming. The Random Forest without PCA achieved the lowest validation error of 5.9%.

In conclusion, the Random Forest model without PCA achieved the highest accuracy of 94.08% on the validation set. This model can effectively predict 5-star phone products on Amazon, assisting users in making informed decisions.

Goal of the study

Surfing Amazon products and reviews to find which products are of the highest quality can often be a stressful and tedious task. This product is aimed to analyze Amazon products and help users understand which products are most likely to suit their needs based on the reviews of products. The intention of this product is to use the text that comprises mobile phone reviews to predict whether a product was given a 5-star review or a sub-5-star review. This will be achieved through text processing and regression methods. Reviews on Amazon tend to skew high so even a 4.5-star average rating can result in a poor product. However, often a product that has 5 stars will satisfy a customer’s expectations. Therefore, we predict for 5-star vs sub-5-star review.

First, we will clean the dataset and perform EDA to better understand the dataset. We will split the data into train, test, and validation sets and perform PCA analysis to reduce dimensionality. We will fit several prediction models to the data and analyze each model’s accuracy to select the most useful model. By the end, we hope to provide a tool that can be used by Amazon customers to save time and become more satisfied out of their shopping on Amazon.

Dataset

Data was sourced from the Kaggle Star Rating Prediction dataset to obtain reviews for unlocked mobile phones sold on Amazon.com. Each review contains the following features (possibly null): product name, brand name, phone price, review rating, the review text itself, and number of votes that the review received. Review rating range from 1 to 5. The dataset contains over 414K reviews.

Data Processing

First, we load in the data from the Kaggle dataset.

amazon <- read.csv("Amazon_Unlocked_Mobile.csv")
summary(amazon)

Next, we process the data. We see that we start with 414k rows and 6 columns. We have columns for product name, brand name, price, rating, reviews, and number of review votes. Rating is scored 1 to 5, reviews is a text review from a single user, and number of review votes is the number of votes that particular review received. Hence, each row represent an unique phone review.

To process the data, we first drop the rows with N/A values. We then calculate the number of ratings per unique brand and order the brands from most number of reviews to least number of reviews. We then select only the top 20 reviews since our dataset starts with many random companies that we hadn’t heard of. We then select only the top 20 brands to leave us with brands with lots of reviews. This leaves us with companies such as Samsung, BLU, Apple, LG, BlackBerry, and Nokia. At this point we are still at over 300k rows.

Given our data set is so large, we then take a random sample of 50k rows to move forward with for the rest of our project. This makes the run time of our future models reasonable.

Our data is now cleaned.

# number of rows and columns
nrow(amazon) # 413,840 rows
ncol(amazon) # 6 cols

# drop NA rows
amazon <- na.omit(amazon)

# calculate the number of ratings per brand and order by descending number of ratings
amazon_count <- amazon %>% 
  group_by(Brand.Name) %>% 
  summarize(num_ratings = n()) %>% 
  arrange(desc(num_ratings)) %>% 
  filter(Brand.Name != "")

# select only the top 20 brands
top_brands <- amazon_count$Brand.Name[1:20]
amazon_top <- amazon %>% 
  filter(Brand.Name %in% top_brands)

# number of rows and columns
nrow(amazon_top) # 307,826 rows

# take a random sample of 50k rows (our existing 400k rows make our models too slow)
set.seed(1)
amazon_sample <- amazon_top %>% 
  sample_n(50000, replace = FALSE)
amazon <- amazon_sample

# confirm the number of rows 
nrow(amazon) # 50,000 rows

EDA

Next, we performed EDA on our data.

We see the least priced phone is 1.73 USD and the most priced phone is 2,408.73 USD. The median priced phone is 139.95 USD and the average priced phone is 229.69 USD.

We see most reviews don’t receive any votes (median of 0) but some reviews have many votes (max of 478).

We also calculated the average rating by brand. We see brands have an average rating of 3.8 across their phones. The worst-rated brand (Polaroid) receives an average rating of 2.9 on their phones and the best-rated brand (OtterBox) receives an average rating of 4.5 on their phones.

We also calculate the number of ratings by bands. We see brands have a median number of ratings of 1,270 and an average number of ratings of 2,500. The least-rated brand (verykool) has 183 ratings and the most-rated brand (Samsung) has 10,372 ratings.

We also look at the average phone price by brand. We see the minimum phone price is 90.39 USD, the median phone price is 247.30 USD, the mean phone price is 234.34 USD and the max phone price is 378.94 USD.

Looking at the distribution of the phone ratings, we see 17% of the reviews receive 1 star, 6% of the reviews receive 2 stars, 8% of the ratings receive 3 stars, 15% of ratings receive 4 stars and 55% of ratings receive 5 stars.

# price 
max(amazon$Price) # $2,408.73
min(amazon$Price) # $1.73
median(amazon$Price) # $139.95
mean(amazon$Price) # $229.69

# number of review votes
max(amazon$Review.Votes) # 478
min(amazon$Review.Votes) # 0 
median(amazon$Review.Votes) # 0 

# average rating by brand
amazon_avg <- amazon %>% 
  group_by(Brand.Name) %>% 
  summarize(avg_rating = mean(Rating), num_ratings = n()) %>% 
  arrange(num_ratings) %>% 
  mutate(Brand.Name = reorder(Brand.Name, num_ratings))

summary(amazon_avg)

ggplot(amazon_avg, aes(x=Brand.Name, y=avg_rating)) + 
  geom_bar(stat="identity", fill="blue") +
  labs(title="Average Rating per Brand", x="Brand", y="Average Rating") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# number of ratings by brand
ggplot(amazon_avg, aes(x = Brand.Name, y = num_ratings)) +
  geom_bar(stat = "identity") +
  xlab("Brand") +
  ylab("Number of ratings") +
  ggtitle("Number of ratings per brand") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# calculate the average price by brand
avg_price_by_brand <- amazon %>%
  group_by(Brand.Name) %>%
  summarize(avg_price = mean(Price)) %>%
  top_n(15, avg_price) # keep only the top 15 brands with the highest avg prices

summary(avg_price_by_brand)

# create a bar graph for average price by brand
ggplot(avg_price_by_brand, aes(x = Brand.Name, y = avg_price)) + 
  geom_bar(stat = "identity", fill = "blue") +
  xlab("Brand") +
  ylab("Average Price") +
  ggtitle("Brands vs Average Phone Prices") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# distribution of ratings
ggplot(amazon, aes(x=Rating)) + 
  geom_histogram(binwidth=1, fill="blue", color="white") +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
  labs(title="Histogram of Ratings", x="Ratings", y="Frequency")

rating_dist <- amazon %>% count(amazon$Rating)
rating_dist$Percentage = (rating_dist$n / sum(rating_dist$n))*100
colnames(rating_dist)[colnames(rating_dist) == "amazon$Rating"] ="rating"
rating_dist$rating = factor(rating_dist$rating)

ggplot(rating_dist, aes(x = "", y = n, fill = rating)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(Percentage), "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "Pie Chart of Ratings")+
  scale_fill_manual(values = c("#F8766D", "#C49A00", "#53B400", "#00C094", "#00BFC4"))

Feature Engineering

Creating a two-category response variable

To simplify our analysis, we create a two-category response variable to classify each review into either a 5-star rating or a non-5-star rating. We chose to go with 5-star vs non-5-star review over positive vs negative review because 5-star ratings make up over 50% of the reviews.

We create a new column called “score” for this purpose.

amazon$score <- c(0)
amazon$score[amazon$Rating == 5] <- 1

# distribution of scores
ggplot(amazon, aes(x=score)) + 
  geom_histogram(binwidth=1, fill="blue", color="white") +
  geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
  labs(title="Histogram of Scores", x="Scores", y="Frequency")

score_dist <- amazon %>% count(amazon$score)
score_dist$Percentage = (score_dist$n / sum(score_dist$n))*100
colnames(score_dist)[colnames(score_dist) == "amazon$score"] ="score"
score_dist$score = factor(score_dist$score)

ggplot(score_dist, aes(x = "", y = n, fill = score)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  geom_text(aes(label = paste0(round(Percentage), "%")), position = position_stack(vjust = 0.5)) +
  labs(title = "Pie Chart of Ratings")+
  scale_fill_manual(values = c("#F8766D", "#00C094"))

Document term matrix (aka bag of words)

Next, we extract a document term matrix (dtm) for Reviews. This matrix represents a word frequencies matrix. For each review (rows), record the frequency of each word in the bag of words that appear in any of the reviews. We keep words appearing at least 0.5% of the time among all of the 50k documents.

To clean the reviews we convert the text to lowercase, remove punctuation, remove numbers, and remove common English stopwords. We also use stemming to allow us to count different variations as the same term. Stemming truncates words to their radical form so that score, scoring, scored, or scores will turn into ‘scor’ (i.e., linguistic normalization).

Finally, we combine our variable score with the dtm as a few data frame called amazon_df.

# turn Reviews to corpus
corpus <- VCorpus(VectorSource(amazon$Reviews))

# control list for creating our DTM within DocumentTermMatrix
# we remove punctuation, numbers and stop words
control_list <- list( tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = stopwords("english"),
stemming = TRUE)

# dtm with all terms:
dtm.long <- DocumentTermMatrix(corpus, control = control_list)

# kick out rare words
dtm<- removeSparseTerms(dtm.long, 1-.005)
inspect(dtm)

# create a new data frame
amazon_df <- data.frame(score = amazon$score, as.matrix(dtm))

Predictive Modeling

Before we conduct our analysis, we split our data into training, testing, and validation data. We set aside 32,500 reviews for training, 12,500 reviews for testing, and 5,000 reviews for validation.

set.seed(1)

# split data into training, testing, and validation data sets
index_train <- sample(1:nrow(amazon_df), 32500)
index_test <- sample(setdiff(1:nrow(amazon_df), index_train), 12500)
index_val <- setdiff(1:nrow(amazon_df), c(index_train, index_test))

#Factor the score column to make it categorical
amazon_df$score = factor(amazon_df$score)

# create training, testing, and validation data sets
amazon_df.train <- amazon_df[index_train, ]
amazon_df.test <- amazon_df[index_test, ]
amazon_df.val <- amazon_df[index_val, ]

LASSO (without PCA)

We would like to start with a LASSO (Least Absolute Shrinkage and Selection Operator) model without using PCA. Given our data has lots of variables, LASSO helps us select the significant coefficients. In this model we will train on all features to predict the “score” variable of the data frame. In addition, we will do validation testing on the lambda values to find the optimal lambda value.

#Run an initial glm model and find the optimal lambda value
y <- amazon_df.train$score
X1 <- sparse.model.matrix(score~., data=amazon_df.train)[, -1]
fit.lasso <- cv.glmnet(X1, y, alpha=.99, family="binomial")
lambda <- fit.lasso$lambda.min
plot(fit.lasso)

Using the optimal lambda value, we now create our final LASSO model and predict on the test dataset. We find that our error is around 20%.

# Create the final fit based on the optimal lambda value
final_fit.lasso <- glmnet(X1,y,alpha = 0.99,lambda=lambda,family="binomial")
predict.lasso <- predict(fit.lasso, as.matrix(amazon_df.test[,-1]), type = "class", s="lambda.1se")
# output majority vote labels
# LASSO testing errors
mean(amazon_df.test$score != predict.lasso)
## [1] 0.2012

Word Cloud

From our LASSO model, we select the non-zero words picked up using lambda.1se. We select 328 words that are deemed useful (to predict 5-star reviews).

We then feed these 328 words from above to fit a logistic regression, which we call fit.glm.

We then pull out all of the positive coefficients and the corresponding words. We want the coefficients in a decreasing order.

We then pick up the positive coefficients, which are positively related to the probability of being a 5-star review. We see excelent and excel have the greatest coefficients (2.37 and 1.58 respectively), meaning these words have the highest probability of being associated with a 5-star review. Finally, we arrive at 145 good words which we associate with 5-star reviews.

We create a word cloud with the 100 leading good words, as shown below.

# non-zero words picked up by LASSO when using lambda.1se
coef.1se <- coef(fit.lasso, s="lambda.1se")
lasso.words <- coef.1se@Dimnames[[1]] [coef.1se@i][-1] # non-zero variables without intercept.
summary(lasso.words)
##    Length     Class      Mode 
##       328 character character
# feed the output from LASSO above, get a logistic regression
sel_cols <- c("score", lasso.words)
# use all_of() to specify we would like to select variables in sel_cols
data_sub <- amazon_df.train %>% select(all_of(sel_cols))
fit.glm <- glm(score~., family=binomial, data_sub) 

# pull out all the positive coefficients and the corresponding words. Rank the coefficients in a decreasing order. 
fit.glm.coef <- coef(fit.glm)
hist(fit.glm.coef)

# pick up the positive coefficients
good.glm <- fit.glm.coef[which(fit.glm.coef > 0)]
good.glm <- good.glm[-1] # took intercept out
names(good.glm)[1:20] # which words are positively associated with good ratings
##  [1] "ago"       "amaz"      "android"   "awesom"    "band"      "beat"     
##  [7] "best"      "box"       "bright"    "built"     "cabl"      "came"     
## [13] "can"       "carrier"   "choic"     "close"     "come"      "complain" 
## [19] "complaint" "condit"
good.fre <- sort(good.glm, decreasing = TRUE) # sort the coef's
length(good.fre) # 145 good words
## [1] 145
hist(as.matrix(good.fre), breaks=30, col="red")

good.word <- names(good.fre) # good words with a decreasing order in the coef's

# word cloud
cor.special <- brewer.pal(8,"Dark2") # set color scheme
wordcloud(good.word[1:100], good.fre[1:100], colors=cor.special, ordered.colors=F)

Logistic Regression

We now assess the logistic regression model, fit.glm, used to create our word cloud. Running the model we find the testing error to be around 24%. We also plot an AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve to visualize the model’s performance.

ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting. We find the logistic regression model to have a AUC of 0.74, meaning there is a 74% chance that the model will be able to distinguish between 5-star review and sub-5-star review.

#summary(fit.glm)
probabilities.glm <- fit.glm %>% predict(amazon_df.test[,-1], type = "response")
predicted.classes.glm <- ifelse(probabilities.glm > 0.5, 1, 0)

# evaluate model
# testing error
mean(predicted.classes.glm != amazon_df.test$score)
## [1] 0.24256
# confusion matrix: table(amazon_df.test$score, predicted.classes.glm)

# ROC-AUC curve
ROCPred <- prediction(predicted.classes.glm, amazon_df.test$score) 
ROCPer <- performance(ROCPred, measure = "tpr", x.measure = "fpr")
auc <- performance(ROCPred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7421763
# plotting curve
plot(ROCPer, colorize = TRUE, 
     print.cutoffs.at = seq(0.1, by = 0.1), 
     main = "ROC CURVE")
abline(a = 0, b = 1)
   
auc <- round(auc, 4)
legend(.6, .4, auc, title = "AUC", cex = 1)

PCA

Now we would like to generate PCA versions of the training and test datasets as an alternative method. We will do every model with and without PCA to see which of the models perform the best. PCA helps for 1) dimension reduction: capture the main features and reduce the noise hidden in the data, 2) grouping variables/subjects efficiently: reveal insightful grouping structures, and 3) visualization: Display high dimensional data.

As we will realize later, though, the models without using PCA seem to perform better than the models with using PCA.

#Create the PCAs for the amazon df
pca <- prcomp(amazon_df.train[,-1], scale. = TRUE)
pve <- summary(pca)$importance[2, 1:30]
plot(pve, type="b", pch = 19, frame = FALSE)

We see that 5 is acceptable choice of PCAs to choose at it is close to being the “elbow” of the variance graph.

#Create PCA-versions of the training and test dataset
num_pcas <- 5
pcs = pca$x[,1:num_pcas]
pca_train_df = data.frame(score = amazon_df.train$score, pcs)
pcs_test <- predict(pca,newdata = as.matrix(amazon_df.test[,-1]))[,c(1:num_pcas)]
pca_test_df <- data.frame(pcs_test)
pcs_valid <- predict(pca,newdata = as.matrix(amazon_df.val[,-1]))[,c(1:num_pcas)]
pca_valid_df <- data.frame(pcs_valid)

LASSO Model with PCA

We would now like to test the LASSO model using the PCA version of the inputs. We would like to see if the model can benefit from dimensionality reduction. However, after running the model, we see that the testing error is around 32.8%, which is less than the accuracy of the model without using PCA (20% error).

#Creates the x and y components lasso model to predict a score, find the best lambda value: use PCA of x and y train
y <- amazon_df.train$score
X1 <- sparse.model.matrix(score~., data=pca_train_df)[, -1]
pca_fit.lasso <- cv.glmnet(X1, y, alpha=.99, family="binomial")
lambda <- pca_fit.lasso$lambda.min
plot(fit.lasso)

#Create a new model using the best lambda value
pca_final_fit.lasso <- glmnet(X1,y,alpha = 0.99,lambda=lambda,family="binomial")
pca_predict.lasso <- predict(pca_final_fit.lasso, as.matrix(pca_test_df), type = "class", s="lambda.1se")
# output majority vote labels
# LASSO testing errors
mean(amazon_df.test$score != pca_predict.lasso)
## [1] 0.32816

Random Forest Classification

We would now like to predict on the dataset with the random forest model. It is a non-linear model which may perform better on this dataset with its many dimensions.

We will create both the models with and without PCA using 200 trees and predict using all features apart from score in the dataset.

In addition, the Gini Index for classification will be used as the measure for variable importance. The Gini Index measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. The degree of the index ranges between 0 and 1 with 0 denoting that all elements belong to a certain class or there exists only one class (pure) and 1 denoting that elements are randomly distributed across various classes (impure). We are able to predict a classification model due to our factoring of the score variable during the data processing step.

Random Forest (without PCA)

#Create a Random Forest model based on the PCA values of train data, don't PCA
fit.rf <- ranger::ranger(score~., amazon_df, num.trees = 200, importance="impurity") # no plot
## Growing trees.. Progress: 86%. Estimated remaining time: 5 seconds.
fit.rf
## Ranger result
## 
## Call:
##  ranger::ranger(score ~ ., amazon_df, num.trees = 200, importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  200 
## Sample size:                      50000 
## Number of independent variables:  636 
## Mtry:                             25 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             16.08 %
#Find the testing error of the Random Forest
predict.rf <- predict(fit.rf, data=amazon_df.test, type="response") # output the classes by majority
mean(amazon_df.test$score != predict.rf$predictions)
## [1] 0.06352

Random Forest with PCA

#Create a Random Forest model based on the PCA values of train data, use PCA
pca_fit.rf <- ranger::ranger(score~., pca_train_df, num.trees = 200, importance="impurity") # no plot
pca_fit.rf
## Ranger result
## 
## Call:
##  ranger::ranger(score ~ ., pca_train_df, num.trees = 200, importance = "impurity") 
## 
## Type:                             Classification 
## Number of trees:                  200 
## Sample size:                      32500 
## Number of independent variables:  5 
## Mtry:                             2 
## Target node size:                 1 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error:             23.20 %
#Find the testing error of the Random Forest
pca_predict.rf <- predict(pca_fit.rf, data=pca_test_df, type="response") # output the classes by majority
mean(amazon_df.test$score != pca_predict.rf$predictions)
## [1] 0.23656

Once again, we see that the model without the PCA (6% error) outperformed the model with PCA (24% error). Additionally, we have seen a considerable increase in accuracy from the LASSO models most likely due to the high dimensionality of the dataset.

Model Results

Out of the five models, the Random Forest model without using PCA performed the best on the testing data set. To help confirm this conclusion all five models will run on the validation dataset and their respective accuracies will be compared.

#Find the validation error of all five models
#Logistic regression
val_predict.glm <- fit.glm %>% predict(amazon_df.val[,-1], type = "response")
val_predict.glm <- ifelse(val_predict.glm > 0.5, 1, 0)
#LASSO without PCA
val_predict.lasso <- predict(final_fit.lasso, as.matrix(amazon_df.val[,-1]), type = "class", s="lambda.1se")
#LASSO with PCA
val_pca_predict.lasso <- predict(pca_final_fit.lasso, as.matrix(pca_valid_df), type = "class", s="lambda.1se")
#Random Forest without PCA
val_predict.rf <- predict(fit.rf, data=amazon_df.val, type="response") 
#Random Forest with PCA
val_pca_predict.rf <- predict(pca_fit.rf, data=pca_valid_df, type="response")

#Logistic Regression error
print(paste('Logistic Regression error =', mean(amazon_df.val$score != val_predict.glm)*100))
## [1] "Logistic Regression error = 24.64"
# LASSO without PCA error
print(paste('LASSO without PCA error =', mean(amazon_df.val$score != val_predict.lasso)*100))
## [1] "LASSO without PCA error = 20.06"
#LASSO with PCA error
print(paste('LASSO with PCA error =', mean(amazon_df.val$score != val_pca_predict.lasso)*100))
## [1] "LASSO with PCA error = 31.84"
#Random Forest without PCA error
print(paste('Random Forest without PCA error =', mean(amazon_df.val$score != val_predict.rf$predictions)*100))
## [1] "Random Forest without PCA error = 5.92"
#Random Forest with PCA error
print(paste('Random Forest with PCA error =', mean(amazon_df.val$score != val_pca_predict.rf$predictions)*100))
## [1] "Random Forest with PCA error = 22.7"

The order of the validation errors listed from greatest to smallest:

  1. LASSO with PCA

  2. Logistic Regression

  3. Random Forest with PCA

  4. LASSO without PCA

  5. Random Forest without PCA

Within these results, we see once again that the non-PCA models outperformed the PCA models and that the best model is the Random Forest model without PCA, which had a validation error of 5.92%.

Conclusion

The best model achieved is the Random Forest model with standard inputs, which achieved 94.08% accuracy on the validation set. We now display a confusion matrix on the final model’s prediction on the validation set and we see that both the false positives and false negative misclassification errors are low as well: 11.6% false negative rate and 1.4% false positive rate.

Overall, this model will be useful to predict 5-star phone products on Amazon and help Amazon users become more informed consumers.

table(val_predict.rf$predictions,amazon_df.val$score)
##    
##        0    1
##   0 1955   40
##   1  256 2749
TP <- table(val_predict.rf$predictions,amazon_df.val$score)[1,1]
FP <- table(val_predict.rf$predictions,amazon_df.val$score)[1,2]
FN <- table(val_predict.rf$predictions,amazon_df.val$score)[2,1]
TN <- table(val_predict.rf$predictions,amazon_df.val$score)[2,2]

FNR <- FN / (FN + TP) #11.6%
FPR <- FP / (FP + TN) #1.4%