When we connect to something bigger we start to live well

I think we need to get it to us every day. But often we are waiting for our antidote to earthly troubles for such a long time, like on our trip to Paris or Rome to get this kind of relief and…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Text Analysis of Book Reviews on Goodreads.com

The specific techniques I discuss are:

These are filtered dataset on the books under these specific genres that require less processing power, but note that these files are still a few gigabytes large.

Just like the full dataset, these come in 3 separate files for separate tables containing:

In this analysis I don’t focus as much on the users so I only used the first two .JSON files downloaded from the above Google Drive links.

Before we can start our analysis an essential step is getting, filtering and cleaning our text data.

The first challenge I encountered was the sheer size of the files.

Some techniques I used for faster processing time were:

Even then I had some parts of the script taking for long time, so if you plan to explore the full database, I recommend first saving your data in a database management tool like MongoDB and loading it from there.

Note that here I ended up working with the first 100 thousand reviews and 25 thousand book titles. Importantly, the books and reviews are not ordered by title, id or any variables in the dataset, but seem to be randomly ordered. That means that selecting the first n rows is more likely to be representative of the entire dataset. From the books table I used:

From the reviews table:

After loading the data for the two types of books, I joined them into a single table but made sure to first label their categories as mystery and history as I plan to compare the reviews left under these two categories throughout this project.

Additionally, I filtered for ratings of 1 and 5 starts for the reasons I mentioned in the intro. Note that books can have a rating of 0 which means missing star rating, which still has a text review without any stars selected.

Next, note that the same review can come up in both datasets (thriller and mystery) so I had to deal with these duplications. Here, depending on your question you may leave the duplicated review if you want to get an accurate picture of how close the two genres’ reviews are. However, I decided to drop these duplicate reviews, both of them, as I am more interested in the differences. In any case, there were only about 1500 such rows in around 65k so they likely don’t make much difference.

Another type of comment that may not be useful is the one containing just a URL. Some reviews only leave a link to their blog/website where they have their full post about the book. These are not useful for our analysis, so I used the below regular expression which removes URLs from text but leaves other parts unchanged. If this was the whole review, of course it will leave an empty string, so afterwards remove empty reviews.

The last interesting topic is non-English reviews. Basically one may translate these or drop them, I chose the latter option. However, even then you have to identify foreign text.

Thankfully, R has several packages for this purpose, so I put them to the test. The language detection packages I found are: textcat, cld2 and cld3.

Above you can see the results of my test using these packages on my data of about 60k reviews. The main takeaways:

Based on these results I used cld2 due to it’s good accuracy.

Now that my data is cleared I am left with still nearly 60k reviews so it’s time to explore them. First, I created basic measures of the text and grouped them by type the original dataset and rating as you can see below.

Looking at avg. number of characters and words, we see a few important differences:

Looking at basic measures of complexity we see that:

In summary so far it seems that history reviews and 5 star reviews may be more elaborate. However, also note that in all of these categories there are some reviews with very large values raising the averages.

Looking at the most common words in each category can always be informative. To do that there are some generic steps you usually want to take and as you will see, text specific changes can often be helpful.

In the below code I:

After these steps, I can just group the word occurrences by our categories.

Looking at the results, we see a few words at the top keep coming up: book, read, story these are not surprising given the website we are looking at but they also don’t give much information. So in the previous code I also made a small dictionary of these generic reading related words and removed them to look at top words without them.

Without that noise the words start to make a bit more sense. The very categories we look at history-mystery pop up in the correct groups.

As well as we see some emotions related to the ratings: bore which is the lemma of boredom comes up at 1 star ratings.

It is interesting to see how universal the words love and time are on top of all the categories but we do have some topic words such as war in history. Interestingly, good reviews mention learn there which may be important for readers of historical books. Whereas mystery books have twist and enjoy to indicate what the readers are looking for.

Continuing on this idea, if we want to learn why some books are preferred over others sentiment analysis can be the perfect tool. R has 3 popular generic purpose sentiment lexicons I am aware of: NRC, bing and afinn. They work differently in categorizing sentiment, so I recommend playing around with all of them. In this case I used the NRC lexicon to look for positive and negative words in our book/rating categories. I simply filtered words with positive meanings and looked at the top 20 most common such words in 5 star ratings then did the same for negative words and 1 star reviews. The results are the below wordclouds which by the way also exclude the generic terms from the previous dictionary I made.

While this will also result in just generic positive-negative words related to the topics such as war for history and murder for thrillers - which probably are not related to the actual ratings - we also see a few hints about the user preferences. Confuse and bore are the main such words for negative reviews in both categories while learn and enjoy as positive words. It is interesting how surprise comes up under history, not mystery but other than that we don’t see too much difference in sentiment between the categories. So let’s see if we can find other type of groups with the last technique.

Topic modelling is a huge topic in NLP circles and here I just use a very basic application that showcases the power of this area. I mentioned in the beginning that books are put into shelves by users and these often refer to subgenres made up by the users. So I wanted to see if we can accurately separate the books into such categories just by the content of the reviews left under them. This is not a trivial task at all, we saw that reviews differ in some ways but similar in others across genres so what results can we get from this?

Well first, we have to return to our tokenized text and do more groupings.

Well, first we can look at the words that were the most impactful in determining the topics. Already, we can see that mystery and history came up 2–2 topics which is a good sign. Although, we have many generic words that come up across categories we saw earlier in the dictionaries and sentiments. Still, from these lists we could likely come up with out own topic labels:

So let’s see how well did we do. I validate the results by getting the “beta” of document results which is the likelihood of being in a topic, select the top 10 documents most likely to be in a topic, then join this table to the books table. From there I get the list of shelves a book is put on, select the 10 most common ones and find the results.

If we ignore the common shelves used by most (to-read, reading) we see a mixed success in our assumptions. It seems we swapped topic 1. which actually refers to classical mystery books which are likely lighter in language and topic 2. which is the more fictional side of historical writing.

However, topics 3. and 4. are pretty much what we expected.

I was also very impressed at how well the model separated not just mystery and history books but even found the fictional-nonfictional dark-light mystery subgenres. Keep in mind it had no data about the actual contents of the books, just the reviews left by users proving how useful insight this technique can provide.

In conclusion we saw 3 common techniques for text analysis.

Using measures of length, complexity and word occurrence we learned historical and 5 star rating reviews are usually lengthier and maybe better worded than 1 starred and entertainment focused books’ reviews. In terms of the most common words, we didn’t see too much difference, but excluding generic topic based words we got slightly more interesting results.

Using sentiment analysis we learned that reviews generally have the same preferences for exciting but useful experiences when it comes to books.

Lastly, with topic modelling we were able to separate subgenres of books just based on their reviews with a surprising accuracy.

I hope you found this journey as interesting as I did, should you have any question about any details of the project, let me know via a.bognar93@gmail.com.

Add a comment

Related posts:

Mother

I saw the competition related to writing about one's mother and had already had the idea to write about her anyway, even if not as part of that. I do not know if she would like me to do that, as we…

Top 10 Disposable Vape Flavours

Choosing a good disposable vape flavor and device can be a fun and enjoyable experience, but it can also be overwhelming given the variety of options available. Here are some tips to help you choose…