Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
Tokenizing Text Files#
Description: You may have text files and metadata that you want to tokenize into ngrams with Python.
This notebook takes as input:
Plain text files (.txt) in a folder
A metadata CSV file called ‘metadata.csv’
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
Knowledge Required:
Python Basics (Start Python Basics I)
Knowledge Recommended:
Import Libraries#
from collections import Counter
import gzip
import json
import os
import pandas as pd
Download and inspect sample files#
For purposes of this tutorial, we will download a set of sample files from Project Gutenberg using a helper function from the constellate
client.
from constellate import download_gutenberg_sample
text_file_directory = download_gutenberg_sample()
You now have sample text files and a CSV of metadata in your data directory.
You can list the contents of this directory with this command.
!ls -lt ~/data/gutenberg-sample
You can see the first 20 lines of a sample file with this command.
!head -n 20 ~/data/gutenberg-sample/205-0.txt
Define a tokenizing function#
def constellate_ngrams(text, n=1):
# Define a Counter object to hold our ngrams.
c = Counter()
# Replace line breaks in the text.
t = text.replace("\r", " ").replace("\n", "")
# Convert the text to a list of words.
words = t.split()
# Slice the words into ngrams.
for grams in zip(*[words[i:] for i in range(n)]):
g = " ".join(grams)
c[g] += 1
return c
Tokenize a text#
Let’s tokenize one of the sample files using our function.
# Read in one of the texts. See note about file paths.
with open(f"{text_file_directory}{os.sep}205-0.txt") as input_file:
text = input_file.read()
unigrams = constellate_ngrams(text)
unigrams.most_common(10)
You can create bigrams or trigrams (or n grams) by changing the n
keyword argument passed to the function.
bigrams = constellate_ngrams(text, n=2)
bigrams.most_common(10)
Creating a Constellate JSONL file#
For your analysis, you may want to create files that conform to the same data specification as the files provided by Constellate. The following steps show you how to load metadata and the raw text, create ngrams and output a JSONL (JSON lines) file that matches, in format, what you download from the Constellate web application.
df = pd.read_csv(text_file_directory + os.sep + "metadata.csv")
df.head()
Loop through the dataframe and print out some of the metadata.
for item in df.itertuples():
print(item.title, item.author, item.url)
Now convert the metadata to the Constellate schema as defined here by mapping the column names from the source csv to the corresponding Constellate schema attributes.
# Create a list to hold our documents.
documents = []
for item in df.itertuples():
document = {
"id": item.url,
"title": item.title,
"creator": [item.author],
"docType": "book",
"publicationYear": item.published,
"url": item.url,
"language": [item.language]
}
documents.append(document)
Now that we have our metadata stored in a list, let’s revise our function to capture the full text of the documents and generate ngrams.
# Create a list to hold our documents.
documents = []
for item in df.itertuples():
document = {
"id": item.url,
"title": item.title,
# A document can have authors/creators, so map as a list.
"creator": [item.author],
"docType": "book",
"publicationYear": item.published,
"url": item.url,
# A document can have multiple languages, so map as a list.
"language": [item.language]
}
# Read in the full text
with open(text_file_directory + "/" + item.file) as text_file:
text = text_file.read()
# Split the text into pages. See note below.
document["fullText"] = text.split("\n\n\n")
# Generate ngrams
document["unigramCount"] = constellate_ngrams(text, n=1)
document["bigramCount"] = constellate_ngrams(text, n=2)
document["trigramCount"] = constellate_ngrams(text, n=3)
# Add our document to the list of documents
documents.append(document)
print(f"{item.title} processed")
Inspect the first document and print some of the metadata and content.
first_doc = documents[0]
print(first_doc["title"], first_doc["publicationYear"])
Print the twenty five most common trigrams.
for term, count in first_doc["trigramCount"].most_common(25):
print(term, count)
Print the first 500 characters of “page” 20.
print(first_doc["fullText"][20][:500])
Generate a Constellate gzip file#
You may now want to create a gzip file so that it matches what you have downloaded from Constellate. You could also then use the gzip_reader
that’s part of the constellate
client to read it.
output_file = text_file_directory + os.sep + "sample_gutenberg_dataset.json.gzip"
with gzip.open(output_file, "wb") as handle:
for doc in documents:
# Convert the document to a string and add the line separator
raw = json.dumps(doc) + "\n"
handle.write(raw.encode())
Now use the dataset reader to read your file back in and verify it is what we expect.
from constellate import dataset_reader
for doc in dataset_reader(output_file):
print(doc["title"], doc["creator"], doc["publicationYear"])
# See note about assert
assert(doc["unigramCount"] is not None)
assert(doc["fullText"] is not None)
Notes#
File paths - in Unix based systems (including Linux and MacOS), a file is separated with a
/
. On Windows the separator is a\
. Python includes the helpfulos.sep
to find the correct file separator for your system. This allows the notebook to run just fine on multiple operating systems.Pagination - the plain text files from Project Gutenberg aren’t paginated. Here we are using a simple rule of thumb: if there are three consecutive line breaks, treat this as a page break. This is unlikely to work well across all Project Gutenberg content but should be sufficient for demonstration purposes. You may be curious about more sophisticated attempts to format Project Gutenberg books, such a chapterize by Jonathan Reeve.
assert
- Python’sassert
statement can be a quick and useful way to validate your logic. By using assert, you are guaranteeing that the program won’t run if the statement is false. So in this usage, we are guaranteeing that each of our documents have afullText
and aunigramCount
attribute.