Advanced Text Analysis#
These notebooks cover various foundational techniques in text analysis. One notebook introduces sentiment analysis using VADER and Hugging Face’s transformers to classify text sentiment. Another explains tokenization, covering concepts like word segmentation, n-grams, stemming, and lemmatization. Two notebooks demonstrate how to process plain text files and metadata into a JSON-L format containing unigrams, bigrams, trigrams, and full text. The final notebook focuses on topic modeling, including dataset retrieval, token cleaning, and topic visualization using Gensim and pyLDAvis.
Sentiment Analysis
Tokenizers
Tokenize Text Files
Tokenize Text Files with NLTK
Topic Modeling