Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License


Latent Dirichlet Allocation (LDA) Topic Modeling#

Description: This notebook demonstrates how to do topic modeling. The following processes are described:

  • Using the constellate client to retrieve a dataset

  • Filtering based on a pre-processed ID list

  • Filtering based on a stop words list

  • Cleaning the tokens in the dataset

  • Creating a gensim dictionary

  • Creating a gensim bag of words corpus

  • Computing a topic list using gensim

  • Visualizing the topic list with pyldavis

Knowledge Required:

Knowledge Recommended:


# Suppress Deprecation Warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Import modules and libraries
import constellate
from pathlib import Path
import gensim
from gensim.models import CoherenceModel
import pyLDAvis.gensim

What is Topic Modeling?#

Topic modeling is a machine learning technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.

Topic modeling is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a supervised, clustering technique called Topic Classification, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.

Topic modeling is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. Topic Classification, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it.

Read more

Import your dataset#

The next code cell tries to import your dataset trying each of the following methods:

  1. Search for a custom dataset in the data folder

  2. Download a full dataset that has been requested

  3. Download a sampled dataset (1500 items) that builds automatically when a dataset is created

If you are using a dataset ID, replace the default dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

  • Use the sample dataset ID already in the code cell

  • Create a new dataset

  • Use a dataset ID from other pre-built sample datasets

The Constellate client will download datasets automatically using either the .download() or .get_dataset() method.

  • Full datasets are downloaded using the .download() method. They must be requested first in the builder environment.

  • Sampled datasets (1500 items) are downloaded using the .get_dataset() method. They are built automatically when a dataset is created.

We’ll use the constellate client library to automatically retrieve the dataset in the JSON file format.

Enter a dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

  • Use the sample dataset ID already in the code cell

  • Create a new dataset

  • Use a dataset ID from other pre-built sample datasets

dataset_id = "8d0b285f-48cf-66fb-4836-6bac965d63cc"
# Check to see if a dataset file exists
# If not, download a dataset using the Constellate Client
# The default dataset is Independent Voices 1960-1990

# Independent Voices is an open access digital collection of alternative press newspapers, magazines and journals,
# drawn from the special collections of participating libraries. These periodicals were produced by feminists, 
# dissident GIs, campus radicals, Native Americans, anti-war activists, Black Power advocates, Hispanics, 
# LGBT activists, the extreme right-wing press and alternative literary magazines 
# during the latter half of the 20th century.

dataset_file = Path.cwd() / '..' /'data' / 'my_data.jsonl.gz' # Make sure this filepath matches your dataset filename

if dataset_file.exists() == False:
    try: 
        dataset_file = constellate.download(dataset_id, 'jsonl')
    except: 
        dataset_file = constellate.get_dataset(dataset_id)

Load Stopwords List#

If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we’ll load the NLTK stopwords list automatically.

We recommend storing your stopwords in a CSV file as shown in the Creating Stopwords List notebook.

# Load a custom data/stop_words.csv if available
# Otherwise, load the nltk stopwords list in English

# Create an empty Python list to hold the stopwords
stop_words = []

# The filename of the custom data/stop_words.csv file
stopwords_path = Path.cwd() / '..' /'data' / 'stop_words.csv'

if stopwords_path.exists():
    import csv
    with stopwords_path.open() as f:
        stop_words = list(csv.reader(f))[0]
    print('Custom stopwords list loaded from CSV')
else:
    # Load the NLTK stopwords list
    from nltk.corpus import stopwords
    stop_words = stopwords.words('english')
    print('NLTK stopwords list loaded')
# Preview stop words
print(stop_words)

Define a Function to Process Tokens#

Next, we create a short function to clean up our tokens.

def process_token(token):
    token = token.lower()
    if token in stop_words:
        return
    if len(token) < 4:
        return
    if not(token.isalpha()):
        return
    return token
%%time
# Limit to n documents. Set to None to use all documents.

limit = 5000

n = 0
documents = []
for document in constellate.dataset_reader(dataset_file):
    processed_document = []
    unigrams = document.get("unigramCount", {})
    for gram, count in unigrams.items():
        clean_gram = process_token(gram)
        if clean_gram is None:
            continue
        processed_document += [clean_gram] * count # Add the unigram as many times as it was counted
    if len(processed_document) > 0:
        documents.append(processed_document)
    if n % 1000 == 0:
        print(f'Unigrams collected for {n} documents...')
    n += 1
    if (limit is not None) and (n >= limit):
       break
print(f'All unigrams collected for {n} documents.')

Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the Gensim LDA Model page.

# Build the gensim dictionary
dictionary = gensim.corpora.Dictionary(documents)
doc_count = len(documents)
num_topics = 7 # Change the number of topics
passes = 5 # The number of passes used to train the model
# Remove terms that appear in less than 50 documents and terms that occur in more than 90% of documents.
dictionary.filter_extremes(no_below=50, no_above=0.90)
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]
bow_corpus[0]
%%time
# Train the LDA model
model = gensim.models.LdaModel(
    corpus=bow_corpus,
    id2word=dictionary,
    num_topics=num_topics,
    passes=passes
)

Perplexity#

After each pass, the LDA model will output a “perplexity” score that measures the “held out log-likelihood”. Perplexity is a measure of how “surpised” the machine is to see certain data. In other words, perplexity measures how successfully a trained topic model predicts new data. The model may be trained many times with different parameters, optimizing for the lowest possible perplexity.

In general, the perplexity score should trend downward as the machine “learns” what to expect from the data. While a low perplexity score may signal the machine has learned the documents’ patterns, that does not mean that the topics formed from a model with low perplexity will form the most coherent topics. (See “Reading Tea Leaves: How Humans Interpret Topic Models” Chang, et al. 2009.)

## Topic Coherence

The failure of perplexity scores to consistently create “good” topics has led to new methods in “topic coherence”. Here we demonstrate two of these methods with Gensim but there are additional methods available. Ideally, a researcher would run many topic models, discovering the optimum settings for topic coherence.

Ultimately, however, the best judgment of topic coherence is a disciplinary expert, particularly someone with familiarity with the materials in question.

Read more

# Compute the coherence score using UMass
# u_mass is measured from -14 to 14, higher is better
coherence_model_lda = CoherenceModel(
    model=model,
    corpus=bow_corpus,
    dictionary=dictionary, 
    coherence='u_mass'
)

# Compute Coherence Score using UMass
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

Display a List of Topics#

Print the most significant terms, as determined by the model, for each topic.

for topic_num in range(0, num_topics):
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Visualize the Topic Distances#

Visualize the model using pyLDAvis. The visualization will be output to an html file in the data folder. (Right-click on the html file and choose “Open in New Browser Tab.”)

Try choosing a topic and adjusting the λ slider. When λ approaches 0, the words in a given document occur almost entirely in that topic. When λ approaches 1, the words occur more often in other topics.

# Export this visualization as an HTML file
# An internet connection is still required to view the HTML
p = pyLDAvis.gensim.prepare(model, bow_corpus, dictionary)
pyLDAvis.save_html(p, '../data/my_visualization.html')