The Basics of Natural Language Programming: A Big Bag of Words

The Basics of Natural Language Programming: A Big Bag of Words

            Natural language processing is how we get computers to understand human language. At first glance, this seems rather daunting. Computers don’t think like you or I. How could we possibly get them to read something like this blog post and understand the nuances of diction and syntax? Well, the basics of natural language processing (NLP) are not all too complicated and in this blog post I endeavor to offer a high level overview of this fascinating subject matter. Specifically, this blog post will talk about the bag of words approach to NLP.

            There are many ways we are already using NLP. If you have every used Siri, Cortana, Alexa, or any of the other intelligent personal assistants, you have used NLP. If you have ever used Google translate, you have used NLP. If you have ever used autocorrect on your phone or computer, you have used NLP. Those three categories should cover the vast majority of folks reading this blog post. But this is simply to exemplify that NLP is a rapidly growing field. But you will realize there is a lot that can go wrong with NLP. Siri’s often gets confused and sometimes the autocorrect in my phone is wildly off.  

            Let’s start with the basics. A big part of NLP (a part I find has a large ROI in the final model) is preprocessing of text data. I tend to make everything lowercase and drop punctuation right off the bat. This allows the computer to see things like “Luke!” and “luke” as the same object. Another technique I employ is called stemming. Stemming reduces a word to the root (not the Latin type). For example, it would reduce the word “credits” to “credit” and “information” to “inform”. This also helps the computer recognize similar words. Lemmatization is another preprocessing step, which converts words like “geese” to “goose” and “cats” to “cat”. Both stemming and lemmatization reduce the number of features the computer creates when tokenizing or count vectorizing (which we will talk about shortly).

All of these processes can be implemented with the Python package Natural Language Toolkit (NLTK). I would highly recommend you take a look at the documentation for NLTK because it is an invaluable tool when it comes to NLP. Preprocessing is a very important step in NLP and I often find myself reworking this step. When you are working with large bodies of text (also known as corpa), the number of unique words can be very large and you should use all of the tools in the toolkit to bring that number down.

            The NLP approach I want to talk about is the bag of words approach. You can build a bag of words model and turn a body of text into a set of features, upon which you can conduct various analysis. Tokenizing, counting, and normalizing can accomplish this. The bag of words model is often used in text classification and sentiment analysis. We want to take a corpus (body of text) and make sense of the words within it.

            Scikit Learn, a popular package for Python, has three great vectorizers that can break down your body of text into features. The Scikit Learn method CountVectorizer will return a count of the number of occurrences for each word. If a sentence uses the word “monkey” once and “banana” three times, the CountVectorizer will have 1 underneath the monkey feature and 3 underneath the banana feature.

There are three parameters that are very helpful with this method: max depth, n-grams, and stop words. Max depth allows you to control the number of features generated from a corpus by giving a max number of features as a parameter. Say you only want a dataset with 50 features. You can set max depth to 50 and CountVectorizer will only take the 50 most frequent words in that set. N-gram range is the parameter that controls the grouping of words. N-gram of two would grab two word pairs from the corpus like “data engineer” instead of individual words like “engineer”. Stop words is a feature that allows you to exclude words from a preset dictionary of words. If you pass “english” into the stopwords parameter, you can exclude count vectorizing words common in the English language like “is”, “as”, “a”, and “the”. This is another form of preprocessing and it helps reduce the number of features this method will produce.

Scikit Learn also has a Hash Vectorizer, which will convert the text into a matrix of occurrences. This transformation makes the information uninterruptable and I rarely use this method because of this byproduct. TF-IDF is another method and it stands for term frequency and inverse document frequency. This score tells us how important a word is in a corpus. The occurrence is weighted by the number of times it appears in the document. This method is good at finding words that discriminate one document from another. For example, if the word “alien” pops up in a specific document and not in others, TF-IDF will be great at picking up on this. The word “alien” will help the computer differentiate documents better than a word like “the”.

            After you have preprocessed the text data and used a CountVectorizer or TF-IDF, you are ready to build a NLP model. This can simply be a logistic regression or k-nearest neighbors at the easiest level. One basic deployment of the bag of words version of NLP is sentiment analysis. In General Assembly’s Data Science Immersive course, I analyzed book review sentiment. The sentiment was either positive or negative (one hot encoded). I preprocessed the data, used TF-IDF, and set my target variable to the sentiment. With a KNN classifier, I was able to accurately predict the sentiment of a book review at a rate of 96%. The model was able to easily process praise and admonishment. The model found it difficult to classify reviews that contained both positive and negative language. For example, the model found it difficult to analyze “I’m sorry I hate to read Harry Potter, but I loved the latest book!”

            Overall, the bag of words approach is a good introduction to NLP. It takes common concepts from machine learning like feature engineering and model building and tweaks them only slightly. There are more advanced methods of NLP and I plan on going over them in the future. But the main thing I wanted to showcase in this blog post was that NLP is really cool and not difficult to implement on your own computer at home. So go ahead and download NLTK!

Support Vector Machines — A Brief Overview

Support Vector Machines — A Brief Overview

Powerlifting Data and Exploratory Data Analysis Part 1

Powerlifting Data and Exploratory Data Analysis Part 1