Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Tokenize Text Files with NLTK#

Description: This notebook takes as input:

  • Plain text files (.txt) in a zipped folder called ‘texts’ in the data folder

  • Metadata CSV file called ‘metadata.csv’ in the data folder (optional)

and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Advanced

Completion time: 10-15 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: .txt, .csv, .jsonl

Libraries Used:

  • os

  • json

  • NLTK

  • gzip

  • nltk.corpus

  • collections

  • pandas

Research Pipeline:

  1. Scan documents

  2. OCR files

  3. Clean up texts

  4. Tokenize text files (this notebook)


Data Inputs#

Texts (.txt)#

All the texts should be in plaintext format. The filenames may be used for reference, so give them descriptive names that will help you identify them for your analysis. Additional data about each text can be supplied in an optional CSV file described below.

Place them in a folder called ‘texts’ then zip that folder into a single file called ‘texts.zip’. The texts

Metadata (.csv) (Optional)#

A CSV file containing metadata may be included for analysis. For specifications, see

The fields may include the following:

Column Name

Description

id

a unique item ID (In JSTOR, this is a stable URL)

title

the title for the document

isPartOf

the larger work that holds this title (for example, a journal title)

publicationYear

the year of publication

doi

the digital object identifier

docType

the type of document (for example, article or book)

provider

the source or provider of the dataset

datePublished

the publication date in yyyy-mm-dd format

issueNumber

the issue number for a journal publication

volumeNumber

the volume number for a journal publication

url

a URL for the item and/or the item’s metadata

creator

the author or authors of the item

language

the language or languages of the item (eng is the ISO 639 code for English)

pageStart

the first page number of the print version

pageEnd

the last page number of the print version

placeOfPublication

the city of the publisher

pageCount

the number of print pages in the item

wordCount

the number of words in the item

pagination

the page sequence in the print version

publisher

the publisher for the item

abstract

the abstract description for the document

outputFormat

what data is available (unigrams, bigrams, trigrams, and/or full-text)

# Download sample data
# Shakespeares Plays from The Folger Shakespeare
# https://shakespeare.folger.edu/download-the-folger-shakespeare-complete-set/
import urllib.request
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

zipfile_address = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/data/texts.zip'
urllib.request.urlretrieve(zipfile_address, '../data/texts.zip')
('../data/texts.zip', <http.client.HTTPMessage at 0x104323590>)

Import Libraries#

import zipfile, os, nltk, json, gzip, pandas as pd
from nltk.corpus import PlaintextCorpusReader
from collections import Counter

Define Functions#

### Various functions written for this notebook ###

def convert_tuple_bigrams(tuples_to_convert):
    """Converts NLTK tuples into bigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        gram_string = f'{first_word} {second_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_tuple_trigrams(tuples_to_convert):
    """Converts NLTK tuples into trigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        third_word = tuple_grams[2]
        gram_string = f'{first_word} {second_word} {third_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_strings_to_counts(string_grams):
    """Converts a Counter of n-grams into a dictionary"""
    counter_of_grams = Counter(string_grams)
    dict_of_grams = dict(counter_of_grams)
    return dict_of_grams

def update_metadata_from_csv():
    """Uses pandas to grab additional metadata fields from a CSV file then adds them to the JSON-L file.
    Unused fields can be commented out."""
    title = df.loc[identifier, 'title']
    isPartOf = df.loc[identifier, 'isPartOf']
    publicationYear = str(df.loc[identifier, 'publicationYear'])
    doi = df.loc[identifier, 'doi']
    docType = df.loc[identifier, 'docType']
    provider = df.loc[identifier, 'provider']
    datePublished = df.loc[identifier, 'datePublished']
    issueNumber = str(df.loc[identifier, 'issueNumber'])
    volumeNumber = str(df.loc[identifier, 'volumeNumber'])
    url = df.loc[identifier, 'url']
    creator = df.loc[identifier, 'creator']
    publisher = df.loc[identifier, 'publisher']
    language = df.loc[identifier, 'language']
    pageStart = df.loc[identifier, 'pageStart']
    pageEnd = df.loc[identifier, 'pageEnd']
    placeOfPublication = df.loc[identifier, 'placeOfPublication']
    pageCount = str(df.loc[identifier, 'pageCount'])

    data.update([   
        ('title', title),
        ('isPartOf', isPartOf),
        ('publicationYear', publicationYear),
        ('doi', doi),
        ('docType', docType),
        ('provider', provider),
        ('datePublished', datePublished),
        ('issueNumber', issueNumber),
        ('volumeNumber', volumeNumber),
        ('url', url),
        ('creator', creator),
        ('publisher', publisher),
        ('language', language),
        ('pageStart', pageStart),
        ('pageEnd', pageEnd),
        ('placeOfPublication', placeOfPublication),
        ('pageCount', pageCount),
    ])

Unzip Texts Folder (optional)#

### Extract Zip File of Texts ###
# The text file should extract into a folder
# called 'texts'

# Alternatively, skip this unzipping code cell.
# Create a folder called 'texts' in the 'data' folder

filename = '../data/texts.zip'

try:
    corpus_zip = zipfile. ZipFile(filename)
    corpus_zip.extractall('../data/texts/')
    corpus_zip.close()
    print('Zip file extracted successfully.')
except:
    print('No zip file detected. Upload your zip file to the data folder.')
Zip file extracted successfully.

Check for Metadata CSV (optional)#

### Check for a metadata CSV file ###

csv_filename = 'metadata.csv'

if os.path.exists(f'../data/{csv_filename}'):
    csv_exists = True
    print('Metadata CSV found.')
else: 
    csv_exists = False
    print('No metadata CSV found.')
No metadata CSV found.

Import the Text Files into NLTK#

### Establish root folder holding all text files ###
# Create corpus using all text files in corpus_root
# By default, this uses punkt tokenizer
# See https://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html
from nltk.corpus import PlaintextCorpusReader
from nltk.tokenize import (TreebankWordTokenizer,
                          word_tokenize,
                          WordPunctTokenizer,
                          TweetTokenizer,
                          MWETokenizer)


corpus_root = '../data/texts'
corpus = PlaintextCorpusReader(corpus_root, '.*txt', word_tokenizer=WordPunctTokenizer())
### Print all File IDs in corpus based on text file names ###
text_list = corpus.fileids()
print('Corpus created from:')
list(text_list)
Corpus created from:
['a-midsummer-nights-dream_TXT_FolgerShakespeare.txt',
 'alls-well-that-ends-well_TXT_FolgerShakespeare.txt',
 'antony-and-cleopatra_TXT_FolgerShakespeare.txt',
 'as-you-like-it_TXT_FolgerShakespeare.txt',
 'coriolanus_TXT_FolgerShakespeare.txt',
 'cymbeline_TXT_FolgerShakespeare.txt',
 'hamlet_TXT_FolgerShakespeare.txt',
 'henry-iv-part-1_TXT_FolgerShakespeare.txt',
 'henry-iv-part-2_TXT_FolgerShakespeare.txt',
 'henry-v_TXT_FolgerShakespeare.txt',
 'henry-vi-part-1_TXT_FolgerShakespeare.txt',
 'henry-vi-part-2_TXT_FolgerShakespeare.txt',
 'henry-vi-part-3_TXT_FolgerShakespeare.txt',
 'henry-viii_TXT_FolgerShakespeare.txt',
 'julius-caesar_TXT_FolgerShakespeare.txt',
 'king-john_TXT_FolgerShakespeare.txt',
 'king-lear_TXT_FolgerShakespeare.txt',
 'loves-labors-lost_TXT_FolgerShakespeare.txt',
 'lucrece_TXT_FolgerShakespeare.txt',
 'macbeth_TXT_FolgerShakespeare.txt',
 'measure-for-measure_TXT_FolgerShakespeare.txt',
 'much-ado-about-nothing_TXT_FolgerShakespeare.txt',
 'othello_TXT_FolgerShakespeare.txt',
 'pericles_TXT_FolgerShakespeare.txt',
 'richard-ii_TXT_FolgerShakespeare.txt',
 'richard-iii_TXT_FolgerShakespeare.txt',
 'romeo-and-juliet_TXT_FolgerShakespeare.txt',
 'shakespeares-sonnets_TXT_FolgerShakespeare.txt',
 'the-comedy-of-errors_TXT_FolgerShakespeare.txt',
 'the-merchant-of-venice_TXT_FolgerShakespeare.txt',
 'the-merry-wives-of-windsor_TXT_FolgerShakespeare.txt',
 'the-phoenix-and-turtle_TXT_FolgerShakespeare.txt',
 'the-taming-of-the-shrew_TXT_FolgerShakespeare.txt',
 'the-tempest_TXT_FolgerShakespeare.txt',
 'the-two-gentlemen-of-verona_TXT_FolgerShakespeare.txt',
 'the-two-noble-kinsmen_TXT_FolgerShakespeare.txt',
 'the-winters-tale_TXT_FolgerShakespeare.txt',
 'timon-of-athens_TXT_FolgerShakespeare.txt',
 'titus-andronicus_TXT_FolgerShakespeare.txt',
 'troilus-and-cressida_TXT_FolgerShakespeare.txt',
 'twelfth-night_TXT_FolgerShakespeare.txt',
 'venus-and-adonis_TXT_FolgerShakespeare.txt']

Generate and Output Data to JSON-L File#

If an old json-l file already exists, this process will overwrite it

For each text, this code will:

  1. Gather unigrams, bigrams, trigrams, and full text

  2. Compute word counts

  3. Check for additional metadata in a CSV file

  4. Write any data to JSON-L file

### Create the JSON-L file and gzip it ###

# For every text: 
# 1. Compute unigrams, bigrams, trigrams, and wordCount
# 2. Append the data to a JSON-L file
# After all data is written, compress the dataset using gzip
## **If the JSONL file exists, it will be overwritten**

# Define the file output name
output_filename = 'my_data.jsonl'

# Delete output files if they already exist
if os.path.exists(f'../data/{output_filename}'):
    os.remove(f'../data/{output_filename}')
    print(f'Overwriting old version of {output_filename}')

if os.path.exists(f'../data/{output_filename}.gz'):
    os.remove(f'../data/{output_filename}.gz')
    print(f'Overwriting old version of {output_filename}.gz\n')
                  

for text in text_list:
    
    # Create identifier from filename
    identifier = text[:-4]
    
    # Compute unigrams
    unigrams = corpus.words(text)
    unigramCount = convert_strings_to_counts(unigrams)
    
    # Compute bigrams
    tuple_bigrams = list(nltk.bigrams(unigrams))
    string_bigrams = convert_tuple_bigrams(tuple_bigrams)
    bigramCount = convert_strings_to_counts(string_bigrams)
    
    # Compute trigrams
    tuple_trigrams = list(nltk.trigrams(unigrams))
    string_trigrams = convert_tuple_trigrams(tuple_trigrams)
    trigramCount = convert_strings_to_counts(string_trigrams)
    
    # Compute fulltext
    with open(f'../data/texts/{text}', 'r') as file:
        fullText = file.read()
    
    # Calculate wordCount
    wordCount = 0
    for counts in unigramCount.values():
        wordCount = wordCount + counts
  
    # Create a dictionary `data` to hold each document's data
    # Including id, wordCount, outputFormat, unigramCount,
    # bigramCount, trigramCount, fullText, etc.
    data = {}
    
    data.update([
        ('id', identifier),
        ('title', identifier),
        ('outputFormat', ['unigram', 'bigram', 'trigram', 'fullText']),
        ('wordCount', wordCount),
        ('fullText', fullText),
        ('unigramCount', unigramCount), 
        ('bigramCount', bigramCount), 
        ('trigramCount', trigramCount)
    ])
    
    # Add additional metadata if there is a metadata.csv available
    if csv_exists == True:
        # Read in the CSV file and set the index
        df = pd.read_csv(f'./data/{csv_filename}')
        df.set_index('id', inplace=True)
        # Update Metadata
        update_metadata_from_csv()
        
    
    # Write the document to the json file  
    with open(f'../data/{output_filename}', 'a') as outfile:
        json.dump(data, outfile)
        outfile.write('\n')
        print(f'Text {text} written to json-l file.')

print('\n' + str(len(text_list)) + f' texts written to {output_filename}.')
Text a-midsummer-nights-dream_TXT_FolgerShakespeare.txt written to json-l file.
Text alls-well-that-ends-well_TXT_FolgerShakespeare.txt written to json-l file.
Text antony-and-cleopatra_TXT_FolgerShakespeare.txt written to json-l file.
Text as-you-like-it_TXT_FolgerShakespeare.txt written to json-l file.
Text coriolanus_TXT_FolgerShakespeare.txt written to json-l file.
Text cymbeline_TXT_FolgerShakespeare.txt written to json-l file.
Text hamlet_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-iv-part-1_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-iv-part-2_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-v_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-1_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-2_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-3_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-viii_TXT_FolgerShakespeare.txt written to json-l file.
Text julius-caesar_TXT_FolgerShakespeare.txt written to json-l file.
Text king-john_TXT_FolgerShakespeare.txt written to json-l file.
Text king-lear_TXT_FolgerShakespeare.txt written to json-l file.
Text loves-labors-lost_TXT_FolgerShakespeare.txt written to json-l file.
Text lucrece_TXT_FolgerShakespeare.txt written to json-l file.
Text macbeth_TXT_FolgerShakespeare.txt written to json-l file.
Text measure-for-measure_TXT_FolgerShakespeare.txt written to json-l file.
Text much-ado-about-nothing_TXT_FolgerShakespeare.txt written to json-l file.
Text othello_TXT_FolgerShakespeare.txt written to json-l file.
Text pericles_TXT_FolgerShakespeare.txt written to json-l file.
Text richard-ii_TXT_FolgerShakespeare.txt written to json-l file.
Text richard-iii_TXT_FolgerShakespeare.txt written to json-l file.
Text romeo-and-juliet_TXT_FolgerShakespeare.txt written to json-l file.
Text shakespeares-sonnets_TXT_FolgerShakespeare.txt written to json-l file.
Text the-comedy-of-errors_TXT_FolgerShakespeare.txt written to json-l file.
Text the-merchant-of-venice_TXT_FolgerShakespeare.txt written to json-l file.
Text the-merry-wives-of-windsor_TXT_FolgerShakespeare.txt written to json-l file.
Text the-phoenix-and-turtle_TXT_FolgerShakespeare.txt written to json-l file.
Text the-taming-of-the-shrew_TXT_FolgerShakespeare.txt written to json-l file.
Text the-tempest_TXT_FolgerShakespeare.txt written to json-l file.
Text the-two-gentlemen-of-verona_TXT_FolgerShakespeare.txt written to json-l file.
Text the-two-noble-kinsmen_TXT_FolgerShakespeare.txt written to json-l file.
Text the-winters-tale_TXT_FolgerShakespeare.txt written to json-l file.
Text timon-of-athens_TXT_FolgerShakespeare.txt written to json-l file.
Text titus-andronicus_TXT_FolgerShakespeare.txt written to json-l file.
Text troilus-and-cressida_TXT_FolgerShakespeare.txt written to json-l file.
Text twelfth-night_TXT_FolgerShakespeare.txt written to json-l file.
Text venus-and-adonis_TXT_FolgerShakespeare.txt written to json-l file.

42 texts written to my_data.jsonl.

Gzip the JSON-L file#

# GZip dataset

f_in = open(f'../data/{output_filename}', 'rb')
f_out = gzip.open(f'../data/{output_filename}.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
print(f'Compression complete. \n{output_filename}.gz has been created.')
Compression complete. 
my_data.jsonl.gz has been created.

Note: The Constellate Lab saves Jupyter Notebooks but not dataset files. Be sure to save your dataset to your local machine or cloud storage.