Tokenize Text Files with NLTK

Tokenize Text Files with NLTK#

Description: This notebook takes as input:

Plain text files (.txt) in a zipped folder called ‘texts’ in the data folder
Metadata CSV file called ‘metadata.csv’ in the data folder (optional)

and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Advanced

Completion time: 10-15 minutes

Knowledge Required:

Python Basics (Start Python Basics 1)

Knowledge Recommended:

Python Intermediate 2

Data Format: .txt, .csv, .jsonl

Libraries Used:

os
json
NLTK
gzip
nltk.corpus
collections
pandas

Research Pipeline:

Scan documents
OCR files
Clean up texts
Tokenize text files (this notebook)

Data Inputs#

Texts (.txt)#

All the texts should be in plaintext format. The filenames may be used for reference, so give them descriptive names that will help you identify them for your analysis. Additional data about each text can be supplied in an optional CSV file described below.

Place them in a folder called ‘texts’ then zip that folder into a single file called ‘texts.zip’. The texts

Metadata (.csv) (Optional)#

A CSV file containing metadata may be included for analysis. For specifications, see

The fields may include the following:

Column Name	Description
id	a unique item ID (In JSTOR, this is a stable URL)
title	the title for the document
isPartOf	the larger work that holds this title (for example, a journal title)
publicationYear	the year of publication
doi	the digital object identifier
docType	the type of document (for example, article or book)
provider	the source or provider of the dataset
datePublished	the publication date in yyyy-mm-dd format
issueNumber	the issue number for a journal publication
volumeNumber	the volume number for a journal publication
url	a URL for the item and/or the item’s metadata
creator	the author or authors of the item
language	the language or languages of the item (eng is the ISO 639 code for English)
pageStart	the first page number of the print version
pageEnd	the last page number of the print version
placeOfPublication	the city of the publisher
pageCount	the number of print pages in the item
wordCount	the number of words in the item
pagination	the page sequence in the print version
publisher	the publisher for the item
abstract	the abstract description for the document
outputFormat	what data is available (unigrams, bigrams, trigrams, and/or full-text)

# Download sample data
# Shakespeares Plays from The Folger Shakespeare
# https://shakespeare.folger.edu/download-the-folger-shakespeare-complete-set/
import urllib.request
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

zipfile_address = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/data/texts.zip'
urllib.request.urlretrieve(zipfile_address, '../data/texts.zip')

('../data/texts.zip', <http.client.HTTPMessage at 0x104323590>)

Import Libraries#

import zipfile, os, nltk, json, gzip, pandas as pd
from nltk.corpus import PlaintextCorpusReader
from collections import Counter

Define Functions#

### Various functions written for this notebook ###

def convert_tuple_bigrams(tuples_to_convert):
    """Converts NLTK tuples into bigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        gram_string = f'{first_word} {second_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_tuple_trigrams(tuples_to_convert):
    """Converts NLTK tuples into trigram strings"""
    string_grams = []
    for tuple_grams in tuples_to_convert:
        first_word = tuple_grams[0]
        second_word = tuple_grams[1]
        third_word = tuple_grams[2]
        gram_string = f'{first_word} {second_word} {third_word}'
        string_grams.append(gram_string)
    return string_grams

def convert_strings_to_counts(string_grams):
    """Converts a Counter of n-grams into a dictionary"""
    counter_of_grams = Counter(string_grams)
    dict_of_grams = dict(counter_of_grams)
    return dict_of_grams

def update_metadata_from_csv():
    """Uses pandas to grab additional metadata fields from a CSV file then adds them to the JSON-L file.
    Unused fields can be commented out."""
    title = df.loc[identifier, 'title']
    isPartOf = df.loc[identifier, 'isPartOf']
    publicationYear = str(df.loc[identifier, 'publicationYear'])
    doi = df.loc[identifier, 'doi']
    docType = df.loc[identifier, 'docType']
    provider = df.loc[identifier, 'provider']
    datePublished = df.loc[identifier, 'datePublished']
    issueNumber = str(df.loc[identifier, 'issueNumber'])
    volumeNumber = str(df.loc[identifier, 'volumeNumber'])
    url = df.loc[identifier, 'url']
    creator = df.loc[identifier, 'creator']
    publisher = df.loc[identifier, 'publisher']
    language = df.loc[identifier, 'language']
    pageStart = df.loc[identifier, 'pageStart']
    pageEnd = df.loc[identifier, 'pageEnd']
    placeOfPublication = df.loc[identifier, 'placeOfPublication']
    pageCount = str(df.loc[identifier, 'pageCount'])

    data.update([   
        ('title', title),
        ('isPartOf', isPartOf),
        ('publicationYear', publicationYear),
        ('doi', doi),
        ('docType', docType),
        ('provider', provider),
        ('datePublished', datePublished),
        ('issueNumber', issueNumber),
        ('volumeNumber', volumeNumber),
        ('url', url),
        ('creator', creator),
        ('publisher', publisher),
        ('language', language),
        ('pageStart', pageStart),
        ('pageEnd', pageEnd),
        ('placeOfPublication', placeOfPublication),
        ('pageCount', pageCount),
    ])

Unzip Texts Folder (optional)#

### Extract Zip File of Texts ###
# The text file should extract into a folder
# called 'texts'

# Alternatively, skip this unzipping code cell.
# Create a folder called 'texts' in the 'data' folder

filename = '../data/texts.zip'

try:
    corpus_zip = zipfile. ZipFile(filename)
    corpus_zip.extractall('../data/texts/')
    corpus_zip.close()
    print('Zip file extracted successfully.')
except:
    print('No zip file detected. Upload your zip file to the data folder.')

Zip file extracted successfully.

Check for Metadata CSV (optional)#

### Check for a metadata CSV file ###

csv_filename = 'metadata.csv'

if os.path.exists(f'../data/{csv_filename}'):
    csv_exists = True
    print('Metadata CSV found.')
else: 
    csv_exists = False
    print('No metadata CSV found.')

No metadata CSV found.

Import the Text Files into NLTK#

### Establish root folder holding all text files ###
# Create corpus using all text files in corpus_root
# By default, this uses punkt tokenizer
# See https://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html
from nltk.corpus import PlaintextCorpusReader
from nltk.tokenize import (TreebankWordTokenizer,
                          word_tokenize,
                          WordPunctTokenizer,
                          TweetTokenizer,
                          MWETokenizer)


corpus_root = '../data/texts'
corpus = PlaintextCorpusReader(corpus_root, '.*txt', word_tokenizer=WordPunctTokenizer())

### Print all File IDs in corpus based on text file names ###
text_list = corpus.fileids()
print('Corpus created from:')
list(text_list)

Corpus created from:

['a-midsummer-nights-dream_TXT_FolgerShakespeare.txt',
 'alls-well-that-ends-well_TXT_FolgerShakespeare.txt',
 'antony-and-cleopatra_TXT_FolgerShakespeare.txt',
 'as-you-like-it_TXT_FolgerShakespeare.txt',
 'coriolanus_TXT_FolgerShakespeare.txt',
 'cymbeline_TXT_FolgerShakespeare.txt',
 'hamlet_TXT_FolgerShakespeare.txt',
 'henry-iv-part-1_TXT_FolgerShakespeare.txt',
 'henry-iv-part-2_TXT_FolgerShakespeare.txt',
 'henry-v_TXT_FolgerShakespeare.txt',
 'henry-vi-part-1_TXT_FolgerShakespeare.txt',
 'henry-vi-part-2_TXT_FolgerShakespeare.txt',
 'henry-vi-part-3_TXT_FolgerShakespeare.txt',
 'henry-viii_TXT_FolgerShakespeare.txt',
 'julius-caesar_TXT_FolgerShakespeare.txt',
 'king-john_TXT_FolgerShakespeare.txt',
 'king-lear_TXT_FolgerShakespeare.txt',
 'loves-labors-lost_TXT_FolgerShakespeare.txt',
 'lucrece_TXT_FolgerShakespeare.txt',
 'macbeth_TXT_FolgerShakespeare.txt',
 'measure-for-measure_TXT_FolgerShakespeare.txt',
 'much-ado-about-nothing_TXT_FolgerShakespeare.txt',
 'othello_TXT_FolgerShakespeare.txt',
 'pericles_TXT_FolgerShakespeare.txt',
 'richard-ii_TXT_FolgerShakespeare.txt',
 'richard-iii_TXT_FolgerShakespeare.txt',
 'romeo-and-juliet_TXT_FolgerShakespeare.txt',
 'shakespeares-sonnets_TXT_FolgerShakespeare.txt',
 'the-comedy-of-errors_TXT_FolgerShakespeare.txt',
 'the-merchant-of-venice_TXT_FolgerShakespeare.txt',
 'the-merry-wives-of-windsor_TXT_FolgerShakespeare.txt',
 'the-phoenix-and-turtle_TXT_FolgerShakespeare.txt',
 'the-taming-of-the-shrew_TXT_FolgerShakespeare.txt',
 'the-tempest_TXT_FolgerShakespeare.txt',
 'the-two-gentlemen-of-verona_TXT_FolgerShakespeare.txt',
 'the-two-noble-kinsmen_TXT_FolgerShakespeare.txt',
 'the-winters-tale_TXT_FolgerShakespeare.txt',
 'timon-of-athens_TXT_FolgerShakespeare.txt',
 'titus-andronicus_TXT_FolgerShakespeare.txt',
 'troilus-and-cressida_TXT_FolgerShakespeare.txt',
 'twelfth-night_TXT_FolgerShakespeare.txt',
 'venus-and-adonis_TXT_FolgerShakespeare.txt']

Generate and Output Data to JSON-L File#

If an old json-l file already exists, this process will overwrite it

For each text, this code will:

Gather unigrams, bigrams, trigrams, and full text
Compute word counts
Check for additional metadata in a CSV file
Write any data to JSON-L file

### Create the JSON-L file and gzip it ###

# For every text: 
# 1. Compute unigrams, bigrams, trigrams, and wordCount
# 2. Append the data to a JSON-L file
# After all data is written, compress the dataset using gzip
## **If the JSONL file exists, it will be overwritten**

# Define the file output name
output_filename = 'my_data.jsonl'

# Delete output files if they already exist
if os.path.exists(f'../data/{output_filename}'):
    os.remove(f'../data/{output_filename}')
    print(f'Overwriting old version of {output_filename}')

if os.path.exists(f'../data/{output_filename}.gz'):
    os.remove(f'../data/{output_filename}.gz')
    print(f'Overwriting old version of {output_filename}.gz\n')
                  

for text in text_list:
    
    # Create identifier from filename
    identifier = text[:-4]
    
    # Compute unigrams
    unigrams = corpus.words(text)
    unigramCount = convert_strings_to_counts(unigrams)
    
    # Compute bigrams
    tuple_bigrams = list(nltk.bigrams(unigrams))
    string_bigrams = convert_tuple_bigrams(tuple_bigrams)
    bigramCount = convert_strings_to_counts(string_bigrams)
    
    # Compute trigrams
    tuple_trigrams = list(nltk.trigrams(unigrams))
    string_trigrams = convert_tuple_trigrams(tuple_trigrams)
    trigramCount = convert_strings_to_counts(string_trigrams)
    
    # Compute fulltext
    with open(f'../data/texts/{text}', 'r') as file:
        fullText = file.read()
    
    # Calculate wordCount
    wordCount = 0
    for counts in unigramCount.values():
        wordCount = wordCount + counts
  
    # Create a dictionary `data` to hold each document's data
    # Including id, wordCount, outputFormat, unigramCount,
    # bigramCount, trigramCount, fullText, etc.
    data = {}
    
    data.update([
        ('id', identifier),
        ('title', identifier),
        ('outputFormat', ['unigram', 'bigram', 'trigram', 'fullText']),
        ('wordCount', wordCount),
        ('fullText', fullText),
        ('unigramCount', unigramCount), 
        ('bigramCount', bigramCount), 
        ('trigramCount', trigramCount)
    ])
    
    # Add additional metadata if there is a metadata.csv available
    if csv_exists == True:
        # Read in the CSV file and set the index
        df = pd.read_csv(f'./data/{csv_filename}')
        df.set_index('id', inplace=True)
        # Update Metadata
        update_metadata_from_csv()
        
    
    # Write the document to the json file  
    with open(f'../data/{output_filename}', 'a') as outfile:
        json.dump(data, outfile)
        outfile.write('\n')
        print(f'Text {text} written to json-l file.')

print('\n' + str(len(text_list)) + f' texts written to {output_filename}.')

Text a-midsummer-nights-dream_TXT_FolgerShakespeare.txt written to json-l file.
Text alls-well-that-ends-well_TXT_FolgerShakespeare.txt written to json-l file.
Text antony-and-cleopatra_TXT_FolgerShakespeare.txt written to json-l file.
Text as-you-like-it_TXT_FolgerShakespeare.txt written to json-l file.
Text coriolanus_TXT_FolgerShakespeare.txt written to json-l file.
Text cymbeline_TXT_FolgerShakespeare.txt written to json-l file.
Text hamlet_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-iv-part-1_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-iv-part-2_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-v_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-1_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-2_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-3_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-viii_TXT_FolgerShakespeare.txt written to json-l file.
Text julius-caesar_TXT_FolgerShakespeare.txt written to json-l file.
Text king-john_TXT_FolgerShakespeare.txt written to json-l file.
Text king-lear_TXT_FolgerShakespeare.txt written to json-l file.
Text loves-labors-lost_TXT_FolgerShakespeare.txt written to json-l file.
Text lucrece_TXT_FolgerShakespeare.txt written to json-l file.
Text macbeth_TXT_FolgerShakespeare.txt written to json-l file.
Text measure-for-measure_TXT_FolgerShakespeare.txt written to json-l file.
Text much-ado-about-nothing_TXT_FolgerShakespeare.txt written to json-l file.
Text othello_TXT_FolgerShakespeare.txt written to json-l file.
Text pericles_TXT_FolgerShakespeare.txt written to json-l file.
Text richard-ii_TXT_FolgerShakespeare.txt written to json-l file.
Text richard-iii_TXT_FolgerShakespeare.txt written to json-l file.
Text romeo-and-juliet_TXT_FolgerShakespeare.txt written to json-l file.
Text shakespeares-sonnets_TXT_FolgerShakespeare.txt written to json-l file.
Text the-comedy-of-errors_TXT_FolgerShakespeare.txt written to json-l file.
Text the-merchant-of-venice_TXT_FolgerShakespeare.txt written to json-l file.
Text the-merry-wives-of-windsor_TXT_FolgerShakespeare.txt written to json-l file.
Text the-phoenix-and-turtle_TXT_FolgerShakespeare.txt written to json-l file.
Text the-taming-of-the-shrew_TXT_FolgerShakespeare.txt written to json-l file.
Text the-tempest_TXT_FolgerShakespeare.txt written to json-l file.
Text the-two-gentlemen-of-verona_TXT_FolgerShakespeare.txt written to json-l file.
Text the-two-noble-kinsmen_TXT_FolgerShakespeare.txt written to json-l file.
Text the-winters-tale_TXT_FolgerShakespeare.txt written to json-l file.
Text timon-of-athens_TXT_FolgerShakespeare.txt written to json-l file.
Text titus-andronicus_TXT_FolgerShakespeare.txt written to json-l file.
Text troilus-and-cressida_TXT_FolgerShakespeare.txt written to json-l file.
Text twelfth-night_TXT_FolgerShakespeare.txt written to json-l file.
Text venus-and-adonis_TXT_FolgerShakespeare.txt written to json-l file.

42 texts written to my_data.jsonl.

Gzip the JSON-L file#

# GZip dataset

f_in = open(f'../data/{output_filename}', 'rb')
f_out = gzip.open(f'../data/{output_filename}.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
print(f'Compression complete. \n{output_filename}.gz has been created.')

Compression complete. 
my_data.jsonl.gz has been created.

Note: The Constellate Lab saves Jupyter Notebooks but not dataset files. Be sure to save your dataset to your local machine or cloud storage.