Natural Language Processing (NLP) for Yelp Reviews

Nov 11, 2017
3 min read

Natural Language Processing (NLP) is widely used technique that is about the interactions between human languages and computer. Voice recognition, natural language generation/understanding are subcategories of NLP. NLP requires a lot of preprocessing in prior to the analysis, since most of the real-world text data are not well-written. Typos, slangs, and coined words are abundant or duplicated data to computer, and it cannot consider the same words in different forms as the same words. Native speakers (humans) need to clean out the text data for computers to help learning the natural languages.

In this post, I will show you how to preprocess on the text data and extract some information from the data. I used yelp reviews which is available from http://www.yelp.com/dataset. I developed topic modeling from reviews on the businesses on Las Vegas Strip, and built unsupervised models to cluster data with topic vectors.

1. Import json files

Let's read json files into pandas DataFrame.

We are almost done with setting up our data. Let's merge reviews for each business, thus we can analyze the reviews for each.

2. Text cleaning for NLP

Data need to be cleaned by removing punctuations, numbers and only keep alphabets. All alphabets need to be lower cased, and you can do more cleaning depends on your data.

Natural Language Toolkit (NLTK) has built-in tags of parts of speech for words. A list of tags can be found here. In this project, I will only keep nouns. NLTK returns a string as a list of words, thus you need to join items in list into a string for future analysis. Also I will remove some common words such as people, vegas, strip, etc.

3. CountVectorizer and LDA

Now all preprocessing is done and we will analyze review texts. We need to convert strings into term-document matrix which is a sparse matrix that will be feature reduced as a topic modeling.

Rows are terms (words) and columns are documents (reviews for each business). As you can see the terms, text data are messy and you can spend more time on cleaning them.

LDA will reduce the dimension of X matrix we just created above which is 120402 by 104 matrix; 120402 unique words in 104 documents. Topic modeling will reduce 120402 dimension to the number of topics we choose. A function will generate topic models. 'num_topics' is a number of topics we would like to generate, and 'passes', in other words 'iteration', represents that how many times to train on the data. Let's build 5 topics from the review data.

We got 5 topics here! I guess we can label the topics as 'Hotel', 'Restaurant / Club', 'Buffet / Streak house', 'Hotel', 'Show'. Let's compare the topics and the actual categories of businesses.

Do you think the 'topic 0' which is 'Hotel' is actually matching the business categories? You can play with the number of topics and the number of passes to improve your topic models.

4. PCA Clustering

We have reduced 120402 dimensions to 5 topics, and PCA can reduce the dimension further, thus we can plot on 2D space and intuitively analyze the 5 topics with colors. 'topic_vectors' are initially filtered by the value of 0.1 and we need to fill the empty elements with zeros.

PCA reduced the dimension to 2nd degree and all five topics have distinct clusters. The numbers in the plot represent the average rates for each topics, and we can assume that PC1 is related to satisfaction of reviewers for the businesses.

From the text data about reviews, we were able to extract distinct topics. The libraries we used (CountVectorizer and LDA) are simple NLP models, and you can try TF-IDF, word2vec, n-grams and NMF. Hope this post helps you start your own NLP projects!