CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Exploring Word Frequencies#

Description: This notebook shows how to find the most common words in a dataset. The following processes are described:

  • Using the constellate client to create a Pandas DataFrame

  • Filtering based on a pre-processed ID list

  • Filtering based on a stop words list

  • Using a Counter() object to get the most common words

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 60 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • constellate client to collect, unzip, and read our dataset

  • NLTK to help clean up our dataset

  • Counter from Collections to help sum up our word frequencies

Research Pipeline:

  1. Build a dataset

  2. Create a “Pre-Processing CSV” with Exploring Metadata (Optional)

  3. Create a “Custom Stopwords List” with Creating a Stopwords List (Optional)

  4. Complete the word frequencies analysis with this notebook


# Import modules and libraries
import constellate
import pandas as pd
from pathlib import Path
import csv

# For making wordclouds
from wordcloud import WordCloud
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from PIL import Image
import urllib.request

Import your dataset#

The next code cell tries to import your dataset trying each of the following methods:

  1. Search for a custom dataset in the data folder

  2. Download a full dataset that has been requested

  3. Download a sampled dataset (1500 items) that builds automatically when a dataset is created

If you are using a dataset ID, replace the default dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

  • Use the sample dataset ID already in the code cell

  • Create a new dataset

  • Use a dataset ID from other pre-built sample datasets

The Constellate client will download datasets automatically using either the .download() or .get_dataset() method.

  • Full datasets are downloaded using the .download() method. They must be requested first in the builder environment.

  • Sampled datasets (1500 items) are downloaded using the .get_dataset() method. They are built automatically when a dataset is created.

dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Check to see if a dataset file exists
# If not, download a dataset using the Constellate Client
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_file = Path.cwd() / '..' / 'data' / 'my_data.jsonl.gz' # Make sure this filepath matches your dataset filename

if dataset_file.exists() == False:
    try: 
        dataset_file = constellate.download(dataset_id, 'jsonl')
    except: 
        dataset_file = constellate.get_dataset(dataset_id)

Extract Unigram Counts from the JSON file (No cleaning)#

The dataset file is a compressed JSON Lines file (jsonl.gz) that contains all the metadata information found in the metadata CSV plus the textual data necessary for analysis including:

  • Unigram Counts

  • Bigram Counts

  • Trigram Counts

  • Full-text (if available)

To complete our analysis, we are going to pull out the unigram counts for each document and store them in a Counter() object. We will import Counter which will allow us to use Counter() objects for counting unigrams. Then we will initialize an empty Counter() object word_frequency to hold all of our unigram counts.

# Import Counter()
from collections import Counter

# Create an empty Counter object called `word_frequency`
word_frequency = Counter()

We can read in each document using the .dataset_reader() method. This method unzips each document and yields the document’s data one-by-one.

# Gather unigramCounts from documents
i = 0
for document in constellate.dataset_reader(dataset_file):
    unigrams = document.get("unigramCount", [])
    for gram, count in unigrams.items():
        word_frequency[gram] += count
    i += 1

# Print success message
print(f'The unigrams from {i} documents were collected.')

Find Most Common Unigrams#

Now that we have a list of the frequency of all the unigrams in our corpus, we need to sort them to find which are most common

for gram, count in word_frequency.most_common(25):
    print(gram.ljust(20), count)

Some issues to consider#

We have successfully created a word frequency list. There are a couple small issues, however, that we still need to address:

  1. There are many function words, words like “the”, “in”, and “of” that are grammatically important but do not carry as much semantic meaning like content words, such as nouns and verbs.

  2. The words represented here are actually case-sensitive strings. That means that the string “the” is a different from the string “The”. You may notice this in your results above.

Extract Unigram Counts from the JSON File (with cleaning)#

To address these issues, we need to find a way to remove common function words and combine strings that may have capital letters in them. We can address these issues by:

  1. Using a stopwords list to remove common function words

  2. Lowercasing all the characters in each string to combine our counts

Load Stopwords List#

If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we’ll load the NLTK stopwords list automatically.

We recommend storing your stopwords in a CSV file as shown in the Creating Stopwords List notebook.

# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English

# Create an empty Python list to hold the stopwords
stop_words = []

# The filename of the custom data/stop_words.csv file
stopwords_path = Path.cwd() / '..' / 'data' / 'stop_words.csv'

if stopwords_path.exists():
    import csv
    with stopwords_path.open() as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')
else:
    # Load the NLTK stopwords list
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print('NLTK stopwords list loaded')
# Preview stop words
print(stop_words)

Gather unigrams again with extra cleaning steps#

In addition to using a stopwords list, we will clean up the tokens by lowercasing all tokens and combining them. This will combine tokens with different capitalization such as “quarterly” and “Quarterly.” We will also remove any tokens that are not alphanumeric.

# Gather unigramCounts from documents in `filtered_id_list` if available
# and apply the processing.

word_frequency = Counter()

for document in constellate.dataset_reader(dataset_file):
    unigrams = document.get("unigramCount", [])
    for gram, count in unigrams.items():
        clean_gram = gram.lower()
        if clean_gram in stop_words:
            continue
        if not clean_gram.isalpha():
            continue
        if len(clean_gram) < 4:
            continue
        word_frequency[clean_gram] += count

Display Results#

Finally, we will display the 20 most common words by using the .most_common() method on the Counter() object.

# Print the most common processed unigrams and their counts
for gram, count in word_frequency.most_common(25):
    print(gram.ljust(20), count)

Export Results to a CSV File#

The word frequency data can be exported to a CSV file.

# Add output method to csv

csv_file = Path.cwd() / '..' / 'data' / 'word_counts.csv'

with csv_file.open('w') as f:
    writer = csv.writer(f)
    writer.writerow(['unigram', 'count'])
    for gram, count in word_frequency.most_common():
        writer.writerow([gram, count])

Create a Word Cloud to Visualize the Data#

A visualization using the WordCloud library in Python. To learn more about customizing a wordcloud, see the documentation.

### Download cloud image for our word cloud shape ###
# It is not required to have a shape to create a word cloud
download_url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/sample_cloud.png'
urllib.request.urlretrieve(download_url, '../data/sample_cloud.png')
print('Cloud shape downloaded.')
# Create a wordcloud from our data

# Adding a mask shape of a cloud to your word cloud
# By default, the shape will be a rectangle
# You can specify any shape you like based on an image file
cloud_mask = np.array(Image.open('../data/sample_cloud.png')) # Specifies the location of the mask shape
cloud_mask = np.where(cloud_mask > 3, 255, cloud_mask) # this line will take all values greater than 3 and make them 255 (white)

### Specify word cloud details
wordcloud = WordCloud(
    width = 800, # Change the pixel width of the image if blurry
    height = 600, # Change the pixel height of the image if blurry
    background_color = "white", # Change the background color
    colormap = 'viridis', # The colors of the words, see https://matplotlib.org/stable/tutorials/colors/colormaps.html
    max_words = 150, # Change the max number of words shown
    min_font_size = 4, # Do not show small text
    
    # Add a shape and outline (known as a mask) to your wordcloud
    contour_color = 'blue', # The outline color of your mask shape
    mask = cloud_mask, # 
    contour_width = 1
).generate_from_frequencies(word_frequency)

mpl.rcParams['figure.figsize'] = (20,20) # Change the image size displayed
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()