Exploring Metadata and Pre-Processing

Exploring Metadata and Pre-Processing#

Description of methods in this notebook: This notebook shows how to explore and pre-process the metadata of a dataset using Pandas.

The following processes are described:

Importing a CSV file containing the metadata for a given dataset ID
Creating a Pandas dataframe to view the metadata
Pre-processing your dataset by filtering out unwanted texts
Importing the full JSON Lines (.jsonl) dataset and filtering it
Visualizing the metadata of your pre-processed dataset by the number of documents/year and pages/year

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 60 minutes

Knowledge Required:

Python Basics Series (Start Python Basics 1)

Knowledge Recommended:

Pandas Series (Start Pandas 1)

Data Format:

Libraries Used:

constellate client to retrieve the metadata in a CSV file
pandas to manipulate and visualize the metadata

Research Pipeline: None

### Import modules and libraries ###

import constellate
import pandas as pd
import os
import json
from pathlib import Path
import gzip

# For displaying plots inline with Jupyter Notebooks
%matplotlib inline

Import your dataset#

We’ll use the constellate client to automatically retrieve the metadata for a dataset. We can retrieve metadata in a CSV file using the get_metadata method.

Enter a dataset ID in the next code cell.

If you don’t have a dataset ID, you can:

Use the sample dataset ID already in the code cell
Create a new dataset
Use a dataset ID from other pre-built sample datasets

# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"

Next, import the constellate client, passing the dataset_id as an argument using the get_metadata method.

# Pull in the sampled (1500 items) dataset CSV using
# The .get_metadata() method downloads the CSV file for our metadata
# to the /data folder and returns a string for the file name and location
# dataset_metadata will be a string containing that file name and location
dataset_metadata = constellate.get_metadata(dataset_id)

# Download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
#dataset_metadata = constellate.download(dataset_id, 'metadata')

We are ready to import pandas for our analysis and create a dataframe. We will use the read_csv() method to create our dataframe from the CSV file.

# Create our dataframe using Pandas
df = pd.read_csv(dataset_metadata)

df

We can confirm the size of our dataset using the len() function on our dataframe.

original_document_count = len(df)
print(f'Total original documents: {original_document_count}')

Now let’s take a look at the data in our dataframe df. We will set pandas to show all columns using set_option() then get a preview using head().

# Set the pandas option to show all columns
# Setting None gives us all columns
# To show less columns replace None with an integer
pd.set_option('display.max_columns', None) 

# Set maximumum number of rows to 50
pd.set_option('display.max_rows', 50)

# Show the first five rows of our dataframe
# To show a different number of preview rows
# Pass an integer into the .head()
df.head() 

Here are descriptions for the metadata types found in each column:

Column Name	Description
id	a unique item ID (In JSTOR, this is a stable URL)
title	the title for the item
isPartOf	the larger work that holds this title (for example, a journal title)
publicationYear	the year of publication
doi	the digital object identifier for an item
docType	the type of document (for example, article or book)
provider	the source or provider of the dataset
datePublished	the publication date in yyyy-mm-dd format
issueNumber	the issue number for a journal publication
volumeNumber	the volume number for a journal publication
url	a URL for the item and/or the item’s metadata
creator	the author or authors of the item
publisher	the publisher for the item
language	the language or languages of the item (eng is the ISO 639 code for English)
pageStart	the first page number of the print version
pageEnd	the last page number of the print version
placeOfPublication	the city of the publisher
wordCount	the number of words in the item
pageCount	the number of print pages in the item
outputFormat	what data is available (unigrams, bigrams, trigrams, and/or full-text)

Filtering out columns using Pandas#

If there are any columns you would like to drop from your analysis, you can drop them with:

df = df.drop(['column_name1', 'column_name2', ...], axis=1)

# Drop each of these named columns
# axis=1 specifies we are dropping columns
# axis=0 would specify to drop rows
df = df.drop(['outputFormat', 'pageEnd', 'pageStart', 'datePublished'], axis=1)

# Show the first five rows of our updated dataframe
df.head()

Filtering out rows with Pandas#

Now that we have filtered out unwanted metadata columns, we can begin filtering out any texts that may not match our research interests. Let’s examine the first and last ten rows of the dataframe to see if we can identify texts that we would like to remove. We are looking for patterns in the metadata that could help us remove many texts at once.

# Preview the first ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.head(10)

# Preview the last ten items in the dataframe
# Can you identify patterns to select rows to remove?
df.tail(10)

# We create a function to report how many documents were removed.

def texts_report(pre_count):
    """Prints out a report of:
    1. How many documents were removed
    2. The total original number of documents
    3. The total current number of documents
    """
    
    removed_count = pre_count - len(df)
    print(f'{removed_count} texts were removed.')
    print(f'Total original documents: {original_document_count}')
    print('Total current documents: ', len(df))

Remove all rows without data for a particular column#

For example, we may wish to remove any texts that do not have authors. (In the case of journals, this may be helpful for removing paratextual sections such as the table of contents, indices, etc.) The column of interest in this case is creator.

# Remove all texts without an author

print('Removing texts without authors...')
initial_count = len(df)
df = df.dropna(subset=['creator']) #drop each row that has no value under 'creators'

# Report the number of texts removed
texts_report(initial_count)

Remove row based on the content of a particular column#

We can also remove texts, depending on whether we do (or do not) want a particular value in a column. Here are a few examples.

# Remove all items with a particular title
# Change title to desired column
# Change `Review Article` to your undesired title
title_to_remove = 'Review Article'

# Removing texts
print(f'Removing texts with title "{title_to_remove}"...')
initial_count = len(df)

# Create a filter that returns all titles that do not match `title_to_remove`
# Apply the filter to the DataFrame
title_filter = df['title'] != title_to_remove
df = df[title_filter]

# Report the number of texts removed
texts_report(initial_count)

# Keep only items with a particular language
# Change language to desired column
# Change 'eng' to your desired language
language = 'eng' # Change to another language code for other languages

# Removing texts
print(f'Removing texts not in "{language}" language...')
initial_count = len(df)

# Create a filter that returns all languages matching `language` variable
language_filter = df['language'] == language
df = df[language_filter] # Apply filter to the DataFrame

# Report the number of texts removed
texts_report(initial_count)

# Remove all items with less than 1500 words
# Change `min_word_count to your desired minimum number of words
min_word_count = 1500

# Removing texts
print(f'Removing texts with fewer than {min_word_count} words...')
initial_count = len(df)

# Create a filter that 
word_count_filter = df['wordCount'] > min_word_count
df = df[word_count_filter]

# Report the number of texts removed
texts_report(initial_count)

Take a final look at your dataframe to make sure the current texts fit your research goals. In the next step, we will save the IDs of your pre-processed dataset.

# Preview the first 50 rows of your dataset
# If all the items look good, move to the next step.
len(df)

Filtering the Full JSON-L Dataset#

We created our filtered list of IDs from the metadata CSV. Next, we will download the full dataset (JSON-L file) and then filter out any unwanted documents.

# Create a Python list of IDs we want from the Pandas dataframe
filtered_id_list = df["id"].tolist()

We can now download the sampled (1500 items) dataset (or the full dataset if it has been requested). See more:

# Pull in the sampled (1500 items) dataset JSON using
# The .get_dataset() method downloads the sampled JSON file
dataset_file = constellate.get_dataset(dataset_id)

# Download the full dataset (JSON-L file)
# The full dataset must be requested first
#dataset_file = constellate.download(dataset_id, 'jsonl')

Next, we will create an empty JSON-L file to filter our results into. We will also check if we want to overwrite the file, if it already exists.

### Create the filtered JSON-L file ###
# This may take several minutes to complete


# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Define the file output name
file_path = Path.cwd() / '..' / 'data' / 'my_data.jsonl' # You may change the name of the file here

# Delete output files if they already exist
if file_path.exists():
    overwrite = input(f'Overwrite {file_path}? (yes/no)')
    if overwrite.lower() == 'yes':
        print(f'Overwriting older version of {file_path}')
        file_path.unlink()
        file_path.touch()

We can now filter through each document in the original JSON file. All documents that match our filtered ID list will be added to the new JSON file.

# Append all documents with ids in `filtered_id_list`
for document in constellate.dataset_reader(dataset_file):
    document_id = document['id']
    # Append any documents in the filtered list
    if document_id in filtered_id_list:
        with file_path.open('a') as outfile:
            json.dump(document, outfile)
            outfile.write('\n')
print(f'{file_path} created.')

Finally, we compress the dataset file using gzip. This will make it easier to transfer the file.

Note! < / >

The Constellate lab has a storage limit of 10 GB. We recommend keeping datasets locally or on another form of cloud storage. Storing large datasets in the lab may slow down your lab startup time.

# Compress the file using gzip
# This may take several minutes to complete
# Download this file if you want to keep it.
# The Constellatae Lab only saves notebooks, not JSON-L datasets
f_in = file_path.open('rb')
f_out = gzip.open(f'{file_path}.gz', 'wb')
f_out.writelines(f_in)
f_out.close()
f_in.close()
print(f'Compression complete. \n{file_path}.gz has been created. Please download to your computer to save.')

Visualizing the Pre-Processed Data#

# Group the data by publication year and the aggregated number of ids into a bar chart
df.groupby(['publicationYear'])['id'].agg('count').plot.bar(title='Documents by year', figsize=(20, 5), fontsize=12); 

# Read more about Pandas dataframe plotting here: 
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html

And now let’s look at the total page numbers by year.

# Group the data by publication year and aggregated sum of the page counts into a bar chart

df.groupby(['publicationYear'])['pageCount'].agg('sum').plot.bar(title='Pages by decade', figsize=(20, 5), fontsize=12);