When looking at the Amazon marketplace, we can see that there are THOUSANDS of reviews for each product in each category. Amazon has a service called Vine where users are are essentially "paid" (sometimes in the form of products) reviewers. Our job today was to use PySpark to analyze the vine review data to decide if the vine program is a success. In other words, are the paid users of vine creating reliable reviews? Does paying reviewers create higher quality reviews?
Software/IDE:Google ColaboratoryData Sources:Amazon Reviews DatasetLibraries:PySpark | Pandas | MatplotlibSource Code:Vine_Review_Analysis.ipynb
As mentioned right above, the category of product reviews we are analyzing is Furniture (US). The goal for today was to answer the following questions:
-
How many Vine reviews and non-Vine reviews were there?
- Vine reviews:
136 - Non-vine reviews:
18,019
- Vine reviews:
-
How many Vine reviews were 5 stars?
- Vine reviews:
74
- Vine reviews:
-
How many non-Vine reviews were 5 stars?
- Non-vine reviews:
8,482
- Non-vine reviews:
-
What percentage of Vine reviews were 5 stars?
- Vine Reviews 5-Star Percentage :
54.41%
- Vine Reviews 5-Star Percentage :
-
What percentage of non-Vine reviews were 5 stars?
- Non-vine Reviews 5-Star Percentage :
47.07%
- Non-vine Reviews 5-Star Percentage :
SPOILER RESULTS FROM PYSPARK DATAFRAME
I added some spoilers to the questions above but below we are going to break it down a little more. Let's get started!
In order to get a good idea of our dataset, we needed to separate our non-vine users from the vine users and get the amount of reviews from these user categories. This gives us a very basic break down of the data distribution.
To ensure quality users, we filtered our original dataframe to users with 20+ reviews. Using that dataframe which we called pdHelpfulDf, a new dataframe called barHelpfulDf was created; Which consists of our reviewers vine status and customer_id as seen above. NOTE: The dataframe above does not diplay any "Y" values in the "vine" column due to there being so few in the dataframe.
As you can see, the vine user review count heavily outweighs the regular users. Ideally there would be more vine user data points because with our current count being so low, the smallest increase could drastically change the results.
- Vine Review Count:
136- There are
136reviews by vine (PAID) users. As mentioned above, this data set would be larger ideally.
- There are
- Non-vine Review Count:
18,019- There are
18,019reviews by vine (UNPAID) users.
- There are
Next we will breakdown the star ratings and their percentages for both the vine and non-vine user reviews; And I've got the perfect visualization for it! A pie chart!
Below is a pie chart displaying the star rating breakdown for the reviews left by vine users. As you can see, not only are we able to see the amount of 5 star ratings and its percentage, we can also see the 1-4 star rating counts and their percentage makeup. NOTE: 1 star ratings are not accounted for due to there being non present.
5 Star- Reviews:
74 - Percentage:
54.41%
- Reviews:
4 Star- Reviews:
45 - Percentage:
33.09%
- Reviews:
3 Star- Reviews:
15 - Percentage:
11.03%
- Reviews:
2 Star- Reviews:
2 - Percentage:
1.47%
- Reviews:
1 StarNo 1 Star reviews present in vine review data.- Reviews:
0 - Percentage:
0%
- Reviews:
Now let's look at the same breakdown for our non vine reviews and see if there is a difference. Remember, the goal here is to see if paying reviewers for their time is worth it.
5 Star- Reviews:
8,482 - Percentage:
47.07%
- Reviews:
4 Star- Reviews:
3,483 - Percentage:
19.33%
- Reviews:
3 Star- Reviews:
3,098 - Percentage:
17.19%
- Reviews:
2 Star- Reviews:
1,680 - Percentage:
9.32%
- Reviews:
1 Star- Reviews:
1,276 - Percentage:
7.08%
- Reviews:
As you can see from both data sets, as mentioned earlier; The review counts are drastically different. Keep that in mind. With that in mind though, we can see that the vine reviews have 7.34% more 5 Star reviews. Along side the additional 7.34% 5-star reviews, there is also no presence of 1-Star values. Possibly some positivity bias happening here.
This analysis definitely needs further investigation and more data points in order to decide if Amazons vine paid reviewer program is worth it.
-
Increase Vine Review Data Set Size
- In order to accurately compare the two datasets (vine vs non-vine) we need many more data points for the vine review data set. As of now, the slightest change can drastically effect the data.
-
Use A Linear Regression Module To Evaluate The Validity of Each Review
- Another big issue (possibly) with the comparison is positivity bias. Are these users leaving better reviews based on the fact that they are being paid? Or are they basing their reviews off of the product itself? Linear regression would help us determine the validity of each vine users review by comparing it to the non-vine review data set. Adding more data points to the vine users would also help determine the review accuracy. Would there still be a 7.34% increase in 5-star ratings if the vine user data set had
18,019data points?
- Another big issue (possibly) with the comparison is positivity bias. Are these users leaving better reviews based on the fact that they are being paid? Or are they basing their reviews off of the product itself? Linear regression would help us determine the validity of each vine users review by comparing it to the non-vine review data set. Adding more data points to the vine users would also help determine the review accuracy. Would there still be a 7.34% increase in 5-star ratings if the vine user data set had






