CC BY license logo

Created by Nathan Kelber and Ted Lawless for JSTOR Labs under Creative Commons CC BY License
For questions/comments/improvements, email nathan.kelber@ithaka.org.


Exploring Metadata and Pre-Processing#

Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.

The following processes are described:

  • Importing a CSV file containing the metadata for a given dataset ID

  • Creating a Pandas dataframe to view the metadata

  • Pre-processing your dataset by filtering out unwanted texts

  • Importing the full JSON Lines (.jsonl) dataset and filtering it

  • Visualizing the metadata of your pre-processed dataset by the number of documents/year and pages/year

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 60 minutes

Knowledge Required:

Knowledge Recommended:

Data Format:

Libraries Used:

Research Pipeline: None


### Import modules and libraries ###

import constellate
import pandas as pd
import os
import json
from pathlib import Path
import gzip

# For displaying plots inline with Jupyter Notebooks
%matplotlib inline

Import your dataset#

We’ll use the constellate client to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata method.

Enter a dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the constellate client, passing the dataset_id as an argument using the get_metadata method.

# Pull in the sampled (1500 items) dataset CSV using
# The .get_metadata() method downloads the CSV file for our metadata
# to the /data folder and returns a string for the file name and location
# dataset_metadata will be a string containing that file name and location
dataset_metadata = constellate.get_metadata(dataset_id)

# Download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_metadata = constellate.download(dataset_id, 'metadata')

We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv() method to create our dataframe from the CSV file.

# Create our dataframe using Pandas
df = pd.read_csv(dataset_metadata)
df

We can confirm the size of our dataset using the len() function on our dataframe.

original_document_count = len(df)
print(f'Total original documents: {original_document_count}')

Now let’s take a look at the data in our dataframe df. We will set pandas to show all columns using set_option() then get a preview using head().

# Set the pandas option to show all columns
# Setting None gives us all columns
# To show less columns replace None with an integer
pd.set_option('display.max_columns', None) 

# Set maximumum number of rows to 50
pd.set_option('display.max_rows', 50)

# Show the first five rows of our dataframe
# To show a different number of preview rows
# Pass an integer into the .head()
df.head() 

Here are descriptions for the metadata types found in each column:

Column Name

Description

id

a unique item ID (In JSTOR, this is a stable URL)

title

the title for the item

isPartOf

the larger work that holds this title (for example, a journal title)

publicationYear

the year of publication

doi

the digital object identifier for an item

docType

the type of document (for example, article or book)

provider

the source or provider of the dataset

datePublished

the publication date in yyyy-mm-dd format

issueNumber

the issue number for a journal publication

volumeNumber

the volume number for a journal publication

url

a URL for the item and/or the item’s metadata

creator

the author or authors of the item

publisher

the publisher for the item

language

the language or languages of the item (eng is the ISO 639 code for English)

pageStart

the first page number of the print version

pageEnd

the last page number of the print version

placeOfPublication

the city of the publisher

wordCount

the number of words in the item

pageCount

the number of print pages in the item

outputFormat

what data is available (unigrams, bigrams, trigrams, and/or full-text)


Filtering out columns using Pandas#

If there are any columns you would like to drop from your analysis, you can drop them with:

df = df.drop(['column_name1', 'column_name2', ...], axis=1)

# Drop each of these named columns
# axis=1 specifies we are dropping columns
# axis=0 would specify to drop rows
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

Filtering out rows with Pandas#

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let’s examine the first and last ten rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

# Preview the first ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.head(10)
# Preview the last ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.tail(10)
# We create a function to report how many documents were removed.

def texts_report(pre_count):
    """Prints out a report of:
    1. How many documents were removed
    2. The total original number of documents
    3. The total current number of documents
    """
    
    removed_count = pre_count - len(df)
    print(f'{removed_count} texts were removed.')
    print(f'Total original documents: {original_document_count}')
    print('Total current documents: ', len(df))

Remove all rows without data for a particular column#

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator.

# Remove all texts without an author

print('Removing texts without authors...')
initial_count = len(df)
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

# Report the number of texts removed
texts_report(initial_count)

Remove row based on the content of a particular column#

We can also remove texts, depending on whether we do (or do not) want a particular value in a column. Here are a few examples.

# Remove all items with a particular title
# Change title to desired column
# Change `Review Article` to your undesired title
title_to_remove = 'Review Article'

# Removing texts
print(f'Removing texts with title "{title_to_remove}"...')
initial_count = len(df)

# Create a filter that returns all titles that do not match `title_to_remove`
# Apply the filter to the DataFrame
title_filter = df['title'] != title_to_remove
df = df[title_filter]

# Report the number of texts removed
texts_report(initial_count)
# Keep only items with a particular language
# Change language to desired column
# Change 'eng' to your desired language
language = 'eng' # Change to another language code for other languages

# Removing texts
print(f'Removing texts not in "{language}" language...')
initial_count = len(df)

# Create a filter that returns all languages matching `language` variable
language_filter = df['language'] == language
df = df[language_filter] # Apply filter to the DataFrame

# Report the number of texts removed
texts_report(initial_count)
# Remove all items with less than 1500 words
# Change `min_word_count to your desired minimum number of words
min_word_count = 1500

# Removing texts
print(f'Removing texts with fewer than {min_word_count} words...')
initial_count = len(df)

# Create a filter that 
word_count_filter = df['wordCount'] > min_word_count
df = df[word_count_filter]

# Report the number of texts removed
texts_report(initial_count)

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

# Preview the first 50 rows of your dataset
# If all the items look good, move to the next step.
len(df)

Filtering the Full JSON-L Dataset#

We created our filtered list of IDs from the metadata CSV. Next, we will download the full dataset (JSON-L file) and then filter out any unwanted documents.

# Create a Python list of IDs we want from the Pandas dataframe
filtered_id_list = df["id"].tolist()

We can now download the sampled (1500 items) dataset (or the full dataset if it has been requested). See more:

# Pull in the sampled (1500 items) dataset JSON using
# The .get_dataset() method downloads the sampled JSON file
dataset_file = constellate.get_dataset(dataset_id)

# Download the full dataset (JSON-L file)
# The full dataset must be requested first
#dataset_file = constellate.download(dataset_id, 'jsonl')

Next, we will create an empty JSON-L file to filter our results into. We will also check if we want to overwrite the file, if it already exists.

### Create the filtered JSON-L file ###
# This may take several minutes to complete


# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Define the file output name
file_path = Path.cwd() / '..' / 'data' / 'my_data.jsonl' # You may change the name of the file here

# Delete output files if they already exist
if file_path.exists():
    overwrite = input(f'Overwrite {file_path}? (yes/no)')
    if overwrite.lower() == 'yes':
        print(f'Overwriting older version of {file_path}')
        file_path.unlink()
        file_path.touch()

We can now filter through each document in the original JSON file. All documents that match our filtered ID list will be added to the new JSON file.

# Append all documents with ids in `filtered_id_list`
for document in constellate.dataset_reader(dataset_file):
    document_id = document['id']
    # Append any documents in the filtered list
    if document_id in filtered_id_list:
        with file_path.open('a') as outfile:
            json.dump(document, outfile)
            outfile.write('\n')
print(f'{file_path} created.')

Finally, we compress the dataset file using gzip. This will make it easier to transfer the file.

Note! < / >

The Constellate lab has a storage limit of 10 GB. We recommend keeping datasets locally or on another form of cloud storage. Storing large datasets in the lab may slow down your lab startup time.

# Compress the file using gzip
# This may take several minutes to complete
# Download this file if you want to keep it.
# The Constellatae Lab only saves notebooks, not JSON-L datasets
f_in = file_path.open('rb')
f_out = gzip.open(f'{file_path}.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
print(f'Compression complete. \n{file_path}.gz has been created. Please download to your computer to save.')

Visualizing the Pre-Processed Data#

# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let’s look at the total page numbers by year.

# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);