<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
___

# Sentiment Analysis

**Description:** This notebook describes Sentiment Analysis and demonstrates basic applications using:
* VADER (Valence Aware Dictionary for sEntiment Reasoning), a rule-based algorithm
* Hugging Face's transformers library

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics I](../../PythonForDataAnalysis/GettingStarted/basic/python-basics-1.ipynb))
___

## Methods for Sentiment Analysis

Sentiment analysis can help an analyst discover whether feedback is positive, negative, or mixed. For example, a large company like Amazon or Walmart could use sentiment analysis on user reviews to determine whether a featured product should be promoted or discontinued. Sentiment analysis generally falls into two categories:

* Rule-based algorithms
* Machine Learning models 

### Rule-Based Algorithms

Rule-based algorithms assign sentiment scores to particular words or multi-word constructions. Simple algorithms may simply assess each word individually in a feedback document and add up an overall score. More complex algorithms may assess multi-word (or n-gram) constructions and have special rules for addressing issues such as negation, emojis, and emoticons. They can detect the difference between "bad", "not bad", and "bad ass". Some algorithms also support emojis and emoticons, such as "=)" and "üòÅ".

### Machine Learning Models

Machine learning models rely on feedback data that has already been assessed by humans to have a particular sentiment. Each piece of feedback is **labeled** by a human reader who may place the feedback into a particular category. The categories could be as simple as positive, negative, or neutral. As long as there exists **labeled** data, a machine learning model can often identify complex concepts. For example, a car manufacturer may desire to classify the sentiment of feedback from past buyers as: "budget-conscious", "eco-conscious", "tech-enthusiastic", "luxury-driven", "performance-driven", etc. Assuming there is an adequately labeled **training data** for each of these categories, a machine learning model could assign a score for each category. This could help analysts understand the brand better, answering questions about what consumers do or do not like about a particular vehicle.

In the humanities, sentiment analysis could be used to track emerging trends on social media. For example, we might ask: "How are Twitter or Reddit users responding to a particular government policy or public event?" We could look at a hashtag like "#blm" and get a sense of national sentiment on the Black Lives Matter movement. The project [On the Books: Jim Crow and Algorithms of Resistance](https://onthebooks.lib.unc.edu/) is using machine classification to detect racist laws based on the pioneering work of [Pauli Murray](https://en.wikipedia.org/wiki/Pauli_Murray) and [Safiya Noble](https://en.wikipedia.org/wiki/Safiya_Noble)'s concept of "algorithmic oppression". 

## VADER

This notebook uses a rule-based algorithm named VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER is a rule-based algorithm that is "specifically attuned to sentiments expressed in social media." It relies on a specialized **lexicon** of words, phrases, and emojis. Each token in the lexicon is assigned a "mean-sentiment rating" between -4 (extremely negative) to 4 (extremely positive). Here are a few examples:

|Token|Mean-Sentiment Rating|
|---|---|
|(:|2.2|
|/:|-1.3|
|):<|-1.9|
|rotflmao|2.8|
|aghast|-1.9|
|awesome|3.1|
|awful|-2.0|

There are over 7500 tokens listed in VADER lexicon. (You can also add your own if you like.) VADER also considers grammatical and syntactical rules to measure intensity based on word order and sensitive relationships between terms. For example, it increases or decreases a sentiment based on degree modifers such as: "The product is good" versus "the product is very good" versus "the product is marginally good." To read more about VADER, including how it works and to see its code, [visit the github page](https://github.com/cjhutto/vaderSentiment).

## Applying the VADER Algorithm
First, we need to import the SentimentIntensityAnalyzer. Here we assign the VADER lexicon object to a variable `sa`.

In [1]:
# Import the SentimentIntensityAnalyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Creat the variable sa to hold the VADER lexicon object 
sa = SentimentIntensityAnalyzer()

We can preview the contents of the lexicon by using `sa.lexicon`. This will return a dictionary, where each key is a token and each value is a sentiment rating.

In [2]:
# Preview the lexicon contents
# There are over 7500 tokens in the lexicon
sa.lexicon

{'$:': -1.5,
 '%)': -0.4,
 '%-)': -1.5,
 '&-:': -0.4,
 '&:': -0.7,
 "( '}{' )": 1.6,
 '(%': -0.9,
 "('-:": 2.2,
 "(':": 2.3,
 '((-:': 2.1,
 '(*': 1.1,
 '(-%': -0.7,
 '(-*': 1.3,
 '(-:': 1.6,
 '(-:0': 2.8,
 '(-:<': -0.4,
 '(-:o': 1.5,
 '(-:O': 1.5,
 '(-:{': -0.1,
 '(-:|>*': 1.9,
 '(-;': 1.3,
 '(-;|': 2.1,
 '(8': 2.6,
 '(:': 2.2,
 '(:0': 2.4,
 '(:<': -0.2,
 '(:o': 2.5,
 '(:O': 2.5,
 '(;': 1.1,
 '(;<': 0.3,
 '(=': 2.2,
 '(?:': 2.1,
 '(^:': 1.5,
 '(^;': 1.5,
 '(^;0': 2.0,
 '(^;o': 1.9,
 '(o:': 1.6,
 ")':": -2.0,
 ")-':": -2.1,
 ')-:': -2.1,
 ')-:<': -2.2,
 ')-:{': -2.1,
 '):': -1.8,
 '):<': -1.9,
 '):{': -2.3,
 ');<': -2.6,
 '*)': 0.6,
 '*-)': 0.3,
 '*-:': 2.1,
 '*-;': 2.4,
 '*:': 1.9,
 '*<|:-)': 1.6,
 '*\\0/*': 2.3,
 '*^:': 1.6,
 ',-:': 1.2,
 "---'-;-{@": 2.3,
 '--<--<@': 2.2,
 '.-:': -1.2,
 '..###-:': -1.7,
 '..###:': -1.9,
 '/-:': -1.3,
 '/:': -1.3,
 '/:<': -1.4,
 '/=': -0.9,
 '/^:': -1.0,
 '/o:': -1.4,
 '0-8': 0.1,
 '0-|': -1.2,
 '0:)': 1.9,
 '0:-)': 1.4,
 '0:-3': 1.5,
 '0:03': 1.9,
 '

In [3]:
# Check if a word is in the lexicon
test_word = 'sweet' # The word to check for

# Get the word's score or print a message for missing words
sa.lexicon.get(test_word, 'No score for that word') 

2.0

In order to do our analysis, we will use a very small sample of 8 user reviews. Each review is a simple text string inside a list variable called `product_reviews`.

In [4]:
# Define a list of product reviews

product_reviews = [
    'I love this product. It helps me get so much work done. I tell everyone about what a great thing it is.',
    'This product is defective. I feel like it is broken because it does not do what it promises. Do not buy this.',
    'Do yourself a favor and buy this product as soon as possible. I recommend it to everyone I know. It has saved me so much time!',
    'This product is overpriced and useless. It was a waste of money and it made all my hair fall out.',
    'Works like a dream and it is a bargain! It solves my problems with ease. I bought two!',
    'Do not buy! This product is a ripoff. I wish it was better, but it fails constantly. What a mistake!',
    'This thing is garbage. Do yourself a favor and save the money. Mine is a dumpster fire and fell apart.',
    'I adore this product. =) It makes my life so much easier. And it is a deal!'
]

Now we will analyze each product and assign it a "normalized, weighted composite score" based on summing the valence scores of each word in the lexicon (with some adjustments based on word order and other rules). VADER measures the proportion of text that falls into positive, negative, and neutral sentiment. The result is a sentiment score that falls between -1 (the most negative) and +1 (the most positive). (This is different from the lexicon scores that fall between -4 to +4!)

In [5]:
# For each review in our `product_reviews` list
# Store a polarity score in `scores`
# Then print the score followed by the review
for review in product_reviews:
    scores = sa.polarity_scores(review)
    print(scores['compound'], review)

0.8979 I love this product. It helps me get so much work done. I tell everyone about what a great thing it is.
-0.2263 This product is defective. I feel like it is broken because it does not do what it promises. Do not buy this.
0.807 Do yourself a favor and buy this product as soon as possible. I recommend it to everyone I know. It has saved me so much time!
-0.6808 This product is overpriced and useless. It was a waste of money and it made all my hair fall out.
0.7772 Works like a dream and it is a bargain! It solves my problems with ease. I bought two!
-0.6792 Do not buy! This product is a ripoff. I wish it was better, but it fails constantly. What a mistake!
0.4404 This thing is garbage. Do yourself a favor and save the money. Mine is a dumpster fire and fell apart.
0.8799 I adore this product. =) It makes my life so much easier. And it is a deal!


Our simple analysis does a fairly good job of assessing positive and negative sentiment. Notice that our second to last review was not very accurate though:
> 0.5423 This thing is garbage. Do yourself a favor and save the money. Mine started on fire and fell apart.

The VADER lexicon contains the following entries:

|Token|Mean-Sentiment Rating|
|---|---|
|favor|1.7|
|fire|-1.4|

VADER assigns a value of -1.4 for "fire" but "fire" can also have a positive connotation, such as "straight fire." However, words like "garbage" and "dumpster," as in "dumpster fire," are less ambiguous. If a specific token is not found in the VADER lexicon, it is considered to be neutral. Like any other statistical approach, the process benefits from having more data. In this case, the sentences are very short and several significant words do not happen to exist in our lexicon. 

## Adding Tokens to the VADER Lexicon

The `sa.lexicon` is a simple dictionary, so we can add words that we want included. There are some guidelines for best scoring practices included in the academic paper linked on [VADER's github repository](https://github.com/cjhutto/vaderSentiment). (Remember that lexicon tokens are scored from -4 to +4.)

In [6]:
# Adding the dictionary of `new_words`
# to sa.lexicon

new_words = {
    'garbage': -2.0,
    'dumpster': -3.1,
}

sa.lexicon.update(new_words)

Let's try our analysis again with the new lexicon.

In [7]:
# For each review in our `product_reviews` list
# Store a polarity score in `scores`
# Then print the score followed by the review

for review in product_reviews:
    scores = sa.polarity_scores(review)
    print(scores['compound'], review)

0.8979 I love this product. It helps me get so much work done. I tell everyone about what a great thing it is.
-0.2263 This product is defective. I feel like it is broken because it does not do what it promises. Do not buy this.
0.807 Do yourself a favor and buy this product as soon as possible. I recommend it to everyone I know. It has saved me so much time!
-0.6808 This product is overpriced and useless. It was a waste of money and it made all my hair fall out.
0.7772 Works like a dream and it is a bargain! It solves my problems with ease. I bought two!
-0.6792 Do not buy! This product is a ripoff. I wish it was better, but it fails constantly. What a mistake!
-0.5574 This thing is garbage. Do yourself a favor and save the money. Mine is a dumpster fire and fell apart.
0.8799 I adore this product. =) It makes my life so much easier. And it is a deal!


## Sentiment analysis with machine learning

The primary advantage of using a machine learning classifier for sentiment analysis is there is no need to maintain a lexicon, assign sentiment scores to particular words, develop linguistic rules based on grammatical structures (negation, intensifiers), or keep track of novel expressions (slang, emoticons, etc.). 

The very best models for sentiment analysis will be trained or tuned on the type of data you are analyzing. We always recommend trying existing models first though, since training a model from scratch takes significant resources, both on the computational side and on the labor side for data quality assurance. If you are interested in training your own models, then you may want to invest in high-end hardware, especially a computer with a powerful graphics processing unit (GPU). If your data or model requires significant resources, consider purchasing cloud-computing resources.

There are many existing models that are a great place to start with sentiment analysis. Let's try using an existing model with the popular [transfomers](https://github.com/huggingface/transformers) library from [HuggingFace](https://huggingface.co/). We will use a [dataset of Amazon game reviews](https://huggingface.co/datasets/LoganKells/amazon_product_reviews_video_games/) created by Logan Kells.

In [None]:
# Install the transformers package
%pip install -q transformers

In [None]:
# Install TensorFlow, a popular library for machine learning
%pip install TensorFlow

In [None]:
# Install the tf_keras package
%pip install tf_keras

In [10]:
# Download the dataset
from pathlib import Path
import urllib.request

# The file URL
url = 'https://huggingface.co/datasets/LoganKells/amazon_product_reviews_video_games/resolve/main/data.csv'

# Check if a data folder exists. If not, create it.
data_folder = Path('../data/')
data_folder.mkdir(exist_ok=True)

# Download the file
path_url = Path(url)
urllib.request.urlretrieve(url, f'{data_folder.as_posix()}/{path_url.name}')
    
## Success message
print('Data downloaded.')

Data downloaded.


In [11]:
# Import Pandas and Read Data CSV file
import pandas as pd
df = pd.read_csv('../data/data.csv')

In [12]:
# Check the size of the dataframe
df.shape

(50000, 10)

In [13]:
#Preview the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,0,A2HD75EMZR8QLN,700099867,123,"[8, 12]",Installing the game was a struggle (because of...,0.0,Pay to unlock content? I don't think so.,1341792000,"07 9, 2012"
1,1,A3UR8NLLY1ZHCX,700099867,"Alejandro Henao ""Electronic Junky""","[0, 0]",If you like rally cars get this game you will ...,3.0,Good rally game,1372550400,"06 30, 2013"
2,2,A1INA0F5CWW3J4,700099867,"Amazon Shopper ""Mr.Repsol""","[0, 0]",1st shipment received a book instead of the ga...,0.0,Wrong key,1403913600,"06 28, 2014"
3,4,A361M14PU2GUEG,700099867,"Angry Ryan ""Ryan A. Forrest""","[2, 2]",I had Dirt 2 on Xbox 360 and it was an okay ga...,3.0,DIRT 3,1308009600,"06 14, 2011"
4,5,A2UTRVO4FDCBH6,700099867,A.R.G.,"[0, 0]","Overall this is a well done racing game, with ...",3.0,"Good racing game, terrible Windows Live Requir...",1368230400,"05 11, 2013"


In [14]:
# Create a list of the first 100 review texts
review_texts = df['reviewText'].tolist()[:100]

In [22]:
# Using pipeline class to make predictions from models available in the Hub in an easy way 
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

sentiment_scores = sentiment_pipeline(review_texts)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


In [19]:
# Examine each review to see how the model did
review = 0

# Print Sentiment
print(sentiment_scores[review])

# Print Review
print(review_texts[review])

{'label': 'NEGATIVE', 'score': 0.9992536902427673}
Installing the game was a struggle (because of games for windows live bugs).Some championship races and cars can only be "unlocked" by buying them as an addon to the game. I paid nearly 30 dollars when the game was new. I don't like the idea that I have to keep paying to keep playing.I noticed no improvement in the physics or graphics compared to Dirt 2.I tossed it in the garbage and vowed never to buy another codemasters game. I'm really tired of arcade style rally/racing games anyway.I'll continue to get my fix from Richard Burns Rally, and you should to. :)http://www.amazon.com/Richard-Burns-Rally-PC/dp/B000C97156/ref=sr_1_1?ie=UTF8&qid;=1341886844&sr;=8-1&keywords;=richard+burns+rallyThank you for reading my review! If you enjoyed it, be sure to rate it as helpful.


In [20]:
# Show reviews with lower certainty scores
for review_number, sentiment_score in enumerate(sentiment_scores):
    if sentiment_score['score'] < .9:
        print(sentiment_score, review_texts[review_number], '\n')

{'label': 'NEGATIVE', 'score': 0.8348784446716309} I have been playing car racing games since their early beginning on PC. I currently have a logitech G25 force feedback wheel to play with.I have played most of the Colin Mc Rae, Need for speed, Grid and Dirt2 games before. I also tried my hands at more simulation oriented games like GTR, GTR2, GP Legends...Dirt2 came in as a nice and pleasant surprise. Dirt3 tops it off.As I am sure many will detail everything about this game, I will limit myself to the most important points for me.Pros:- Amazing graphics- Amazing physics- Challenging but entertaining racesCons:- Gymkhana- InterfaceWhat went through the conceptors mind about gymkhana? Racing is not easy for most but gymkhana is really a pain in the neck.I saw the videos of Ken Block on Youtube and indeed he is really impressive. Am I even dreaming of doing the same? No!Racing require good racing skills to start but Gymkhana requires a perfect control of the vehicles and most of us will