Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.
Tokenize Text Files with NLTK#
Description: This notebook takes as input:
Plain text files (.txt) in a zipped folder called ‘texts’ in the data folder
Metadata CSV file called ‘metadata.csv’ in the data folder (optional)
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Difficulty: Advanced
Completion time: 10-15 minutes
Knowledge Required:
Python Basics (Start Python Basics 1)
Knowledge Recommended:
Data Format: .txt, .csv, .jsonl
Libraries Used:
os
json
NLTK
gzip
nltk.corpus
collections
pandas
Research Pipeline:
Scan documents
OCR files
Clean up texts
Tokenize text files (this notebook)
Data Inputs#
Texts (.txt)#
All the texts should be in plaintext format. The filenames may be used for reference, so give them descriptive names that will help you identify them for your analysis. Additional data about each text can be supplied in an optional CSV file described below.
Place them in a folder called ‘texts’ then zip that folder into a single file called ‘texts.zip’. The texts
Metadata (.csv) (Optional)#
A CSV file containing metadata may be included for analysis. For specifications, see
The fields may include the following:
Column Name |
Description |
---|---|
id |
a unique item ID (In JSTOR, this is a stable URL) |
title |
the title for the document |
isPartOf |
the larger work that holds this title (for example, a journal title) |
publicationYear |
the year of publication |
doi |
the digital object identifier |
docType |
the type of document (for example, article or book) |
provider |
the source or provider of the dataset |
datePublished |
the publication date in yyyy-mm-dd format |
issueNumber |
the issue number for a journal publication |
volumeNumber |
the volume number for a journal publication |
url |
a URL for the item and/or the item’s metadata |
creator |
the author or authors of the item |
language |
the language or languages of the item (eng is the ISO 639 code for English) |
pageStart |
the first page number of the print version |
pageEnd |
the last page number of the print version |
placeOfPublication |
the city of the publisher |
pageCount |
the number of print pages in the item |
wordCount |
the number of words in the item |
pagination |
the page sequence in the print version |
publisher |
the publisher for the item |
abstract |
the abstract description for the document |
outputFormat |
what data is available (unigrams, bigrams, trigrams, and/or full-text) |
# Download sample data
# Shakespeares Plays from The Folger Shakespeare
# https://shakespeare.folger.edu/download-the-folger-shakespeare-complete-set/
import urllib.request
from pathlib import Path
# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)
zipfile_address = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/data/texts.zip'
urllib.request.urlretrieve(zipfile_address, '../data/texts.zip')
('../data/texts.zip', <http.client.HTTPMessage at 0x104323590>)
Import Libraries#
import zipfile, os, nltk, json, gzip, pandas as pd
from nltk.corpus import PlaintextCorpusReader
from collections import Counter
Define Functions#
### Various functions written for this notebook ###
def convert_tuple_bigrams(tuples_to_convert):
"""Converts NLTK tuples into bigram strings"""
string_grams = []
for tuple_grams in tuples_to_convert:
first_word = tuple_grams[0]
second_word = tuple_grams[1]
gram_string = f'{first_word} {second_word}'
string_grams.append(gram_string)
return string_grams
def convert_tuple_trigrams(tuples_to_convert):
"""Converts NLTK tuples into trigram strings"""
string_grams = []
for tuple_grams in tuples_to_convert:
first_word = tuple_grams[0]
second_word = tuple_grams[1]
third_word = tuple_grams[2]
gram_string = f'{first_word} {second_word} {third_word}'
string_grams.append(gram_string)
return string_grams
def convert_strings_to_counts(string_grams):
"""Converts a Counter of n-grams into a dictionary"""
counter_of_grams = Counter(string_grams)
dict_of_grams = dict(counter_of_grams)
return dict_of_grams
def update_metadata_from_csv():
"""Uses pandas to grab additional metadata fields from a CSV file then adds them to the JSON-L file.
Unused fields can be commented out."""
title = df.loc[identifier, 'title']
isPartOf = df.loc[identifier, 'isPartOf']
publicationYear = str(df.loc[identifier, 'publicationYear'])
doi = df.loc[identifier, 'doi']
docType = df.loc[identifier, 'docType']
provider = df.loc[identifier, 'provider']
datePublished = df.loc[identifier, 'datePublished']
issueNumber = str(df.loc[identifier, 'issueNumber'])
volumeNumber = str(df.loc[identifier, 'volumeNumber'])
url = df.loc[identifier, 'url']
creator = df.loc[identifier, 'creator']
publisher = df.loc[identifier, 'publisher']
language = df.loc[identifier, 'language']
pageStart = df.loc[identifier, 'pageStart']
pageEnd = df.loc[identifier, 'pageEnd']
placeOfPublication = df.loc[identifier, 'placeOfPublication']
pageCount = str(df.loc[identifier, 'pageCount'])
data.update([
('title', title),
('isPartOf', isPartOf),
('publicationYear', publicationYear),
('doi', doi),
('docType', docType),
('provider', provider),
('datePublished', datePublished),
('issueNumber', issueNumber),
('volumeNumber', volumeNumber),
('url', url),
('creator', creator),
('publisher', publisher),
('language', language),
('pageStart', pageStart),
('pageEnd', pageEnd),
('placeOfPublication', placeOfPublication),
('pageCount', pageCount),
])
Unzip Texts Folder (optional)#
### Extract Zip File of Texts ###
# The text file should extract into a folder
# called 'texts'
# Alternatively, skip this unzipping code cell.
# Create a folder called 'texts' in the 'data' folder
filename = '../data/texts.zip'
try:
corpus_zip = zipfile. ZipFile(filename)
corpus_zip.extractall('../data/texts/')
corpus_zip.close()
print('Zip file extracted successfully.')
except:
print('No zip file detected. Upload your zip file to the data folder.')
Zip file extracted successfully.
Check for Metadata CSV (optional)#
### Check for a metadata CSV file ###
csv_filename = 'metadata.csv'
if os.path.exists(f'../data/{csv_filename}'):
csv_exists = True
print('Metadata CSV found.')
else:
csv_exists = False
print('No metadata CSV found.')
No metadata CSV found.
Import the Text Files into NLTK#
### Establish root folder holding all text files ###
# Create corpus using all text files in corpus_root
# By default, this uses punkt tokenizer
# See https://www.nltk.org/_modules/nltk/corpus/reader/plaintext.html
from nltk.corpus import PlaintextCorpusReader
from nltk.tokenize import (TreebankWordTokenizer,
word_tokenize,
WordPunctTokenizer,
TweetTokenizer,
MWETokenizer)
corpus_root = '../data/texts'
corpus = PlaintextCorpusReader(corpus_root, '.*txt', word_tokenizer=WordPunctTokenizer())
### Print all File IDs in corpus based on text file names ###
text_list = corpus.fileids()
print('Corpus created from:')
list(text_list)
Corpus created from:
['a-midsummer-nights-dream_TXT_FolgerShakespeare.txt',
'alls-well-that-ends-well_TXT_FolgerShakespeare.txt',
'antony-and-cleopatra_TXT_FolgerShakespeare.txt',
'as-you-like-it_TXT_FolgerShakespeare.txt',
'coriolanus_TXT_FolgerShakespeare.txt',
'cymbeline_TXT_FolgerShakespeare.txt',
'hamlet_TXT_FolgerShakespeare.txt',
'henry-iv-part-1_TXT_FolgerShakespeare.txt',
'henry-iv-part-2_TXT_FolgerShakespeare.txt',
'henry-v_TXT_FolgerShakespeare.txt',
'henry-vi-part-1_TXT_FolgerShakespeare.txt',
'henry-vi-part-2_TXT_FolgerShakespeare.txt',
'henry-vi-part-3_TXT_FolgerShakespeare.txt',
'henry-viii_TXT_FolgerShakespeare.txt',
'julius-caesar_TXT_FolgerShakespeare.txt',
'king-john_TXT_FolgerShakespeare.txt',
'king-lear_TXT_FolgerShakespeare.txt',
'loves-labors-lost_TXT_FolgerShakespeare.txt',
'lucrece_TXT_FolgerShakespeare.txt',
'macbeth_TXT_FolgerShakespeare.txt',
'measure-for-measure_TXT_FolgerShakespeare.txt',
'much-ado-about-nothing_TXT_FolgerShakespeare.txt',
'othello_TXT_FolgerShakespeare.txt',
'pericles_TXT_FolgerShakespeare.txt',
'richard-ii_TXT_FolgerShakespeare.txt',
'richard-iii_TXT_FolgerShakespeare.txt',
'romeo-and-juliet_TXT_FolgerShakespeare.txt',
'shakespeares-sonnets_TXT_FolgerShakespeare.txt',
'the-comedy-of-errors_TXT_FolgerShakespeare.txt',
'the-merchant-of-venice_TXT_FolgerShakespeare.txt',
'the-merry-wives-of-windsor_TXT_FolgerShakespeare.txt',
'the-phoenix-and-turtle_TXT_FolgerShakespeare.txt',
'the-taming-of-the-shrew_TXT_FolgerShakespeare.txt',
'the-tempest_TXT_FolgerShakespeare.txt',
'the-two-gentlemen-of-verona_TXT_FolgerShakespeare.txt',
'the-two-noble-kinsmen_TXT_FolgerShakespeare.txt',
'the-winters-tale_TXT_FolgerShakespeare.txt',
'timon-of-athens_TXT_FolgerShakespeare.txt',
'titus-andronicus_TXT_FolgerShakespeare.txt',
'troilus-and-cressida_TXT_FolgerShakespeare.txt',
'twelfth-night_TXT_FolgerShakespeare.txt',
'venus-and-adonis_TXT_FolgerShakespeare.txt']
Generate and Output Data to JSON-L File#
If an old json-l file already exists, this process will overwrite it
For each text, this code will:
Gather unigrams, bigrams, trigrams, and full text
Compute word counts
Check for additional metadata in a CSV file
Write any data to JSON-L file
### Create the JSON-L file and gzip it ###
# For every text:
# 1. Compute unigrams, bigrams, trigrams, and wordCount
# 2. Append the data to a JSON-L file
# After all data is written, compress the dataset using gzip
## **If the JSONL file exists, it will be overwritten**
# Define the file output name
output_filename = 'my_data.jsonl'
# Delete output files if they already exist
if os.path.exists(f'../data/{output_filename}'):
os.remove(f'../data/{output_filename}')
print(f'Overwriting old version of {output_filename}')
if os.path.exists(f'../data/{output_filename}.gz'):
os.remove(f'../data/{output_filename}.gz')
print(f'Overwriting old version of {output_filename}.gz\n')
for text in text_list:
# Create identifier from filename
identifier = text[:-4]
# Compute unigrams
unigrams = corpus.words(text)
unigramCount = convert_strings_to_counts(unigrams)
# Compute bigrams
tuple_bigrams = list(nltk.bigrams(unigrams))
string_bigrams = convert_tuple_bigrams(tuple_bigrams)
bigramCount = convert_strings_to_counts(string_bigrams)
# Compute trigrams
tuple_trigrams = list(nltk.trigrams(unigrams))
string_trigrams = convert_tuple_trigrams(tuple_trigrams)
trigramCount = convert_strings_to_counts(string_trigrams)
# Compute fulltext
with open(f'../data/texts/{text}', 'r') as file:
fullText = file.read()
# Calculate wordCount
wordCount = 0
for counts in unigramCount.values():
wordCount = wordCount + counts
# Create a dictionary `data` to hold each document's data
# Including id, wordCount, outputFormat, unigramCount,
# bigramCount, trigramCount, fullText, etc.
data = {}
data.update([
('id', identifier),
('title', identifier),
('outputFormat', ['unigram', 'bigram', 'trigram', 'fullText']),
('wordCount', wordCount),
('fullText', fullText),
('unigramCount', unigramCount),
('bigramCount', bigramCount),
('trigramCount', trigramCount)
])
# Add additional metadata if there is a metadata.csv available
if csv_exists == True:
# Read in the CSV file and set the index
df = pd.read_csv(f'./data/{csv_filename}')
df.set_index('id', inplace=True)
# Update Metadata
update_metadata_from_csv()
# Write the document to the json file
with open(f'../data/{output_filename}', 'a') as outfile:
json.dump(data, outfile)
outfile.write('\n')
print(f'Text {text} written to json-l file.')
print('\n' + str(len(text_list)) + f' texts written to {output_filename}.')
Text a-midsummer-nights-dream_TXT_FolgerShakespeare.txt written to json-l file.
Text alls-well-that-ends-well_TXT_FolgerShakespeare.txt written to json-l file.
Text antony-and-cleopatra_TXT_FolgerShakespeare.txt written to json-l file.
Text as-you-like-it_TXT_FolgerShakespeare.txt written to json-l file.
Text coriolanus_TXT_FolgerShakespeare.txt written to json-l file.
Text cymbeline_TXT_FolgerShakespeare.txt written to json-l file.
Text hamlet_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-iv-part-1_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-iv-part-2_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-v_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-1_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-2_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-vi-part-3_TXT_FolgerShakespeare.txt written to json-l file.
Text henry-viii_TXT_FolgerShakespeare.txt written to json-l file.
Text julius-caesar_TXT_FolgerShakespeare.txt written to json-l file.
Text king-john_TXT_FolgerShakespeare.txt written to json-l file.
Text king-lear_TXT_FolgerShakespeare.txt written to json-l file.
Text loves-labors-lost_TXT_FolgerShakespeare.txt written to json-l file.
Text lucrece_TXT_FolgerShakespeare.txt written to json-l file.
Text macbeth_TXT_FolgerShakespeare.txt written to json-l file.
Text measure-for-measure_TXT_FolgerShakespeare.txt written to json-l file.
Text much-ado-about-nothing_TXT_FolgerShakespeare.txt written to json-l file.
Text othello_TXT_FolgerShakespeare.txt written to json-l file.
Text pericles_TXT_FolgerShakespeare.txt written to json-l file.
Text richard-ii_TXT_FolgerShakespeare.txt written to json-l file.
Text richard-iii_TXT_FolgerShakespeare.txt written to json-l file.
Text romeo-and-juliet_TXT_FolgerShakespeare.txt written to json-l file.
Text shakespeares-sonnets_TXT_FolgerShakespeare.txt written to json-l file.
Text the-comedy-of-errors_TXT_FolgerShakespeare.txt written to json-l file.
Text the-merchant-of-venice_TXT_FolgerShakespeare.txt written to json-l file.
Text the-merry-wives-of-windsor_TXT_FolgerShakespeare.txt written to json-l file.
Text the-phoenix-and-turtle_TXT_FolgerShakespeare.txt written to json-l file.
Text the-taming-of-the-shrew_TXT_FolgerShakespeare.txt written to json-l file.
Text the-tempest_TXT_FolgerShakespeare.txt written to json-l file.
Text the-two-gentlemen-of-verona_TXT_FolgerShakespeare.txt written to json-l file.
Text the-two-noble-kinsmen_TXT_FolgerShakespeare.txt written to json-l file.
Text the-winters-tale_TXT_FolgerShakespeare.txt written to json-l file.
Text timon-of-athens_TXT_FolgerShakespeare.txt written to json-l file.
Text titus-andronicus_TXT_FolgerShakespeare.txt written to json-l file.
Text troilus-and-cressida_TXT_FolgerShakespeare.txt written to json-l file.
Text twelfth-night_TXT_FolgerShakespeare.txt written to json-l file.
Text venus-and-adonis_TXT_FolgerShakespeare.txt written to json-l file.
42 texts written to my_data.jsonl.
Gzip the JSON-L file#
# GZip dataset
f_in = open(f'../data/{output_filename}', 'rb')
f_out = gzip.open(f'../data/{output_filename}.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
print(f'Compression complete. \n{output_filename}.gz has been created.')
Compression complete.
my_data.jsonl.gz has been created.
Note: The Constellate Lab saves Jupyter Notebooks but not dataset files. Be sure to save your dataset to your local machine or cloud storage.