Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
Creating a Stopwords List#
Description: This notebook explains what a stopwords list is and how to create one. The following processes are described:
Loading the NLTK stopwords list
Modifying the stopwords list in Python
Saving a stopwords list to a .csv file
Loading a stopwords list from a .csv file
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Completion time: 20 minutes
Knowledge Required:
Python Basics Series (Start Python Basics I)
Knowledge Recommended: None
Data Format: CSV files
Libraries Used:
nltk to create an initial stopwords list
csv to read and write the stopwords to a file
Research Pipeline: None
The Purpose of a Stopwords List#
Many text analytics techniques are based on counting the occurrence of words in a given text or set of texts (called a corpus). The most frequent words can reveal general textual patterns, but the most frequent words for any given text in English tend to look very similar to this:
Word |
Frequency |
---|---|
the |
1,160,276 |
of |
906,898 |
and |
682,419 |
in |
461,328 |
to |
418,017 |
a |
334,082 |
is |
214,663 |
that |
204,277 |
by |
181,605 |
as |
177,774 |
There are many function words, words like “the”, “in”, and “of” that are grammatically important but do not carry as much semantic meaning in comparison to content words, such as nouns and verbs.
For this reason, many analysts remove common function words using a stopwords list. There are many sources for stopwords lists. (We’ll use the Natural Language Toolkit stopwords list in this lesson.) There is no official, standardized stopwords list for text analysis.
An effective stopwords list depends on:
the texts being analyzed
the purpose of the analysis
Even if we remove all common function words, there are often formulaic repetitions in texts that may be counter-productive for the research goal.The researcher is responsible for making educated decisions about whether or not to include any particular stopword given the research context.
Here are a few examples where additional stopwords may be necessary:
A corpus of law books is likely to have formulaic, archaic repetition, such as, “hereunto this law is enacted…”
A corpus of dramatic plays is likely to have speech markers for each line, leading to an over-representation of character names (Hamlet, Gertrude, Laertes, etc.)
A corpus of emails is likely to have header language (to, from, cc, bcc), technical language (attached, copied, thread, chain) and salutations (attached, best, dear, cheers, etc.)
Because every research project may require unique stopwords, it is important for researchers to learn to create and modify stopwords lists.
Examining the NLTK Stopwords List#
The Natural Language Toolkit Stopwords list is well-known and a natural starting point for creating your own list. Let’s take a look at what it contains before learning to make our own modifications.
We will store our stopwords in a Python list variable called stop_words
.
# Creating a stop_words list from the NLTK. We could also use the set of stopwords from Spacy or Gensim.
from nltk.corpus import stopwords # Import stopwords from nltk.corpus
stop_words = stopwords.words('english') # Create a list `stop_words` that contains the English stop words list
If you’re curious what is in our stopwords list, we can use the print()
or list()
functions to find out.
list(stop_words) # Show each string in our stopwords list
Alternative Stopwords Lists: spaCy and Gensim#
Alternatively, you could load the stopwords list from spaCy or Gensim.
# Install spaCy
!pip install spacy
# Download the trained spaCy English pipeline
!python -m spacy download en_core_web_sm
# Load the spaCy English stopwords list
import spacy
sp = spacy.load('en_core_web_sm')
stop_words = sp.Defaults.stop_words
# Create a stopwords list from the Gensim frozen set.
import gensim
from gensim.parsing.preprocessing import STOPWORDS
stop_words = list(STOPWORDS)
Storing Stopwords in a CSV File#
Storing the stopwords list in a variable like stop_words
is useful for analysis, but we will likely want to keep the list even after the session is over for future changes and analyses. We can store our stop words list in a CSV file. A CSV, or “Comma-Separated Values” file, is a plain-text file with commas separating each entry. The file can be opened and modified with a text editor or spreadsheet software such as Excel or Google Sheets.
Here’s what our NLTK stopwords list will look like as a CSV file opened in a plain text editor.
Let’s create an example CSV using the csv
module.
# Create a CSV file to store a set of stopwords
import csv # Import the csv module to work with csv files
with open('../data/stop_words.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(stop_words)
We have created a new file called data/stop_words.csv that you can open and modify using a basic text editor. Go ahead and make a change to your data/stop_words.csv (either adding or subtracting words) using a text editor. Remember, there are no spaces between words in the CSV file. If you want to edit the CSV right inside Jupyter Lab, right-click on the file and select “Open With > Editor.”
Now go ahead and add in a new word. Remember a few things:
Each word is separated from the next word by a comma.
There are no spaces between the words.
You must save changes to the file if you’re using a text editor, Excel, or the Jupyter Lab editor.
You can reopen the file to make sure your changes were saved.
Now let’s read our CSV file back and overwrite our original stop_words
list variable.
Reading in a Stopwords CSV#
# Open the CSV file and list the contents
with open('../data/stop_words.csv', 'r') as f:
stop_words = f.read().strip().split(",")
stop_words[-10:]
Refining a stopwords list for your analysis can take time. It depends on:
What you are hoping to discover (for example, are function words important?)
The material you are analyzing (for example, journal articles may repeat words like “abstract”)
If your results are not satisfactory, you can always come back and adjust the stopwords. You may need to run your analysis many times to refine a good stopword list.