Executive Summary

This project is completed in R using the following methods: NLP/text mining, logistic regression, LASSO, PCA, and random forest.

This project is aimed to predict star ratings of unlocked mobile phones on Amazon.com using machine learning techniques. The Star Rating Prediction dataset from Kaggle is utilized which contains features like product name, product brand, product price, review rating, review text, and the number of votes received a view received.

The exploratory data analysis revealed insights such as phone price range, review votes distribution, average ratings by brand, and the distribution of phone ratings. A two-category response variable was created to classify reviews as 5-star or non-5-star. A document term matrix (dtm) was generated for review text after preprocessing steps.

A LASSO model was trained on all features, achieving an error rate of approximately 20.0%. Relevant words associated with 5-star reviews were identified using LASSO and logistic regression (24.6% error). Principal Component Analysis (PCA) was applied to the LASSO model but did not improve model performance significantly.

A Random Forest Classification model was implemented on both original and PCA-transformed data, with the non-PCA model outperforming. The Random Forest without PCA achieved the lowest validation error of 5.9%.

In conclusion, the Random Forest model without PCA achieved the highest accuracy of 94.08% on the validation set. This model can effectively predict 5-star phone products on Amazon, assisting users in making informed decisions.

Goal of the study

Surfing Amazon products and reviews to find which products are of the highest quality can often be a stressful and tedious task. This product is aimed to analyze Amazon products and help users understand which products are most likely to suit their needs based on the reviews of products. The intention of this product is to use the text that comprises mobile phone reviews to predict whether a product was given a 5-star review or a sub-5-star review. This will be achieved through text processing and regression methods. Reviews on Amazon tend to skew high so even a 4.5-star average rating can result in a poor product. However, often a product that has 5 stars will satisfy a customer’s expectations. Therefore, we predict for 5-star vs sub-5-star review.

First, we will clean the dataset and perform EDA to better understand the dataset. We will split the data into train, test, and validation sets and perform PCA analysis to reduce dimensionality. We will fit several prediction models to the data and analyze each model’s accuracy to select the most useful model. By the end, we hope to provide a tool that can be used by Amazon customers to save time and become more satisfied out of their shopping on Amazon.


Data was sourced from the Kaggle Star Rating Prediction dataset to obtain reviews for unlocked mobile phones sold on Amazon.com. Each review contains the following features (possibly null): product name, brand name, phone price, review rating, the review text itself, and number of votes that the review received. Review rating range from 1 to 5. The dataset contains over 414K reviews.

Data Processing

First, we load in the data from the Kaggle dataset.

amazon <- read.csv("Amazon_Unlocked_Mobile.csv")

Next, we process the data. We see that we start with 414k rows and 6 columns. We have columns for product name, brand name, price, rating, reviews, and number of review votes. Rating is scored 1 to 5, reviews is a text review from a single user, and number of review votes is the number of votes that particular review received. Hence, each row represent an unique phone review.

To process the data, we first drop the rows with N/A values. We then calculate the number of ratings per unique brand and order the brands from most number of reviews to least number of reviews. We then select only the top 20 reviews since our dataset starts with many random companies that we hadn’t heard of. We then select only the top 20 brands to leave us with brands with lots of reviews. This leaves us with companies such as Samsung, BLU, Apple, LG, BlackBerry, and Nokia. At this point we are still at over 300k rows.

Given our data set is so large, we then take a random sample of 50k rows to move forward with for the rest of our project. This makes the run time of our future models reasonable.

Our data is now cleaned.

# number of rows and columns
nrow(amazon) # 413,840 rows
ncol(amazon) # 6 cols

# drop NA rows
amazon <- na.omit(amazon)

# calculate the number of ratings per brand and order by descending number of ratings
amazon_count <- amazon %>% 
  group_by(Brand.Name) %>% 
  summarize(num_ratings = n()) %>% 
  arrange(desc(num_ratings)) %>% 
  filter(Brand.Name != "")

# select only the top 20 brands
top_brands <- amazon_count$Brand.Name[1:20]
amazon_top <- amazon %>% 
  filter(Brand.Name %in% top_brands)

# number of rows and columns
nrow(amazon_top) # 307,826 rows

# take a random sample of 50k rows (our existing 400k rows make our models too slow)
amazon_sample <- amazon_top %>% 
  sample_n(50000, replace = FALSE)
amazon <- amazon_sample

# confirm the number of rows 
nrow(amazon) # 50,000 rows


Next, we performed EDA on our data.

We see the least priced phone is 1.73 USD and the most priced phone is 2,408.73 USD. The median priced phone is 139.95 USD and the average priced phone is 229.69 USD.

We see most reviews don’t receive any votes (median of 0) but some reviews have many votes (max of 478).

We also calculated the average rating by brand. We see brands have an average rating of 3.8 across their phones. The worst-rated brand (Polaroid) receives an average rating of 2.9 on their phones and the best-rated brand (OtterBox) receives an average rating of 4.5 on their phones.

We also calculate the number of ratings by bands. We see brands have a median number of ratings of 1,270 and an average number of ratings of 2,500. The least-rated brand (verykool) has 183 ratings and the most-rated brand (Samsung) has 10,372 ratings.

We also look at the average phone price by brand. We see the minimum phone price is 90.39 USD, the median phone price is 247.30 USD, the mean phone price is 234.34 USD and the max phone price is 378.94 USD.

Looking at the distribution of the phone ratings, we see 17% of the reviews receive 1 star, 6% of the reviews receive 2 stars, 8% of the ratings receive 3 stars, 15% of ratings receive 4 stars and 55% of ratings receive 5 stars.

# price 
max(amazon$Price) # $2,408.73
min(amazon$Price) # $1.73
median(amazon$Price) # $139.95
mean(amazon$Price) # $229.69

# number of review votes
max(amazon$Review.Votes) # 478
min(amazon$Review.Votes) # 0 
median(amazon$Review.Votes) # 0 

# average rating by brand
amazon_avg <- amazon %>% 
  group_by(Brand.Name) %>% 
  summarize(avg_rating = mean(Rating), num_ratings = n()) %>% 
  arrange(num_ratings) %>% 
  mutate(Brand.Name = reorder(Brand.Name, num_ratings))


ggplot(amazon_avg, aes(x=Brand.Name, y=avg_rating)) + 
  geom_bar(stat="identity", fill="blue") +
  labs(title="Average Rating per Brand", x="Brand", y="Average Rating") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# number of ratings by brand
ggplot(amazon_avg, aes(x = Brand.Name, y = num_ratings)) +
  geom_bar(stat = "identity") +
  xlab("Brand") +
  ylab("Number of ratings") +
  ggtitle("Number of ratings per brand") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# calculate the average price by brand
avg_price_by_brand <- amazon %>%
  group_by(Brand.Name) %>%
  summarize(avg_price = mean(Price)) %>%
  top_n(15, avg_price) # keep only the top 15 brands with the highest avg prices


# create a bar graph for average price by brand
ggplot(avg_price_by_brand, aes(x = Brand.Name, y = avg_price)) + 
  geom_bar(stat = "identity", fill = "blue") +
  xlab("Brand") +
  ylab("Average Price") +
  ggtitle("Brands vs Average Phone Prices") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))