Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 75 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,76 @@
# Text-Analysis-Project
# Text-Analysis-Project of *A Perfect Gentleman*

# Part 4 Write Up and Reflection

## 1. Project Overview

For this project, I used text from Project Gutenberg, specifically the public domain book *A Perfect Gentleman* by Ralph Hale Mottram. The primary techniques employed included text cleaning, stopword removal, word frequency analysis, and sentiment analysis using NLTK’s VADER tool. I also utilized AI to create a summarizing feature. My goals were to process the raw text, extract meaningful insights, and analyze the sentiment trends throughout the book. Additionally, I aimed to visualize the most frequent words and summarize the text to better understand its structure and tone. I hoped to learn about the different ways text data can be processed and expand my python code knowledge.

---

## 2. Implementation

The system consists of several major components:

1. **Text Cleaning**:
The text is loaded from a local pickle file and cleaned to remove Project Gutenberg headers, punctuation, and irrelevant characters.

2. **Stopword Removal & Word Frequency**:
Stopwords are removed, and word frequency analysis is performed using Python’s `Counter` class to identify the most common words.

3. **Visualization**:
The top 20 most frequent words are displayed using a simple text-based bar chart in the console. Each word is listed alongside a series of `#` symbols proportional to its frequency, providing an easy-to-read representation of the most common words without requiring external plotting libraries.

4. **Sentiment Analysis (Optional NLP)**:
Using NLTK’s VADER, the text is tokenized into sentences and each sentence is scored for sentiment. An overall sentiment score is calculated by averaging all sentence-level compound scores.
A key design decision was whether to compute sentiment on individual sentences or the entire text. Sentence-level analysis was chosen because it provides more granular insights into fluctuations in tone, instead of averaging extremes across the whole book.

5. **AI-Powered Text Summarization (Optional NLP)**:
A transformer-based model (`distilbart-cnn-12-6`) is used to generate a concise summary of the text. For longer texts, the system splits the text into manageable chunks, summarizes each chunk, and combines the results into a final overview. This feature leverages AI to provide an intelligent summary that highlights the book’s main points and themes, complementing the quantitative analyses from word frequencies and sentiment scores. When the transformers library is unavailable, a keyword-based summarizer provides a fallback summary using frequency-weighted sentence selection.

AI tools, like ChatGPT, were instrumental throughout the project: they helped clarify NLTK setup, resolve tokenizer errors, guide efficient processing of large texts, and suggest the addition of an AI-powered summarizing feature that enhances the analytical workflow.

### AI Assistance Screenshots

**ChatGPT NLTK help:**
![ChatGPT NLTK Help](images/AI_image1.png)

**ChatGPT Error Fixed:**
![ChatGPT Error Fixed](images/AI_image2.png)

**ChatGPT: added summarizing feature from Part 3:**
![ChatGPT added summarizing feature](images/AI_image3.png)

---

## 3. Results

The text analysis of *A Perfect Gentleman* revealed several interesting patterns and insights. After cleaning the text and removing stopwords, we identified the most frequent words in the book. As expected, common nouns and character names appeared prominently, giving a sense of recurring people and interactions. The top 20 most frequent words are visualized in a bar chart, which clearly highlights the key terms that dominate the narrative.

Sentiment analysis using NLTK’s VADER provided a granular view of the book's tone. By analyzing individual sentences, we observed fluctuations between positive, neutral, and negative sentiment throughout the chapters. This sentence-level approach revealed moments of tension or optimism that would have been averaged out if only an overall sentiment score had been calculated. For example, certain passages describing pivotal character interactions consistently showed higher positive sentiment, while scenes of conflict scored negative.

**Example Sentiment Scores (Preview of 5 Sentences):**
| Sentence Preview | Sentiment Score (Compound) |
| ----------------------------------------------------- | -------------------------- |
| I feel six feet tall. My feet touch the wall. | 0.42 |
| And then I figured, I’ll have a drink… | 0.13 |
| It was an uncovered peril… | ‑0.34 |
| I stayed with you all night. I forgot to pray. | 0.21 |
| And when I look into your eyes I see a season or two. | 0.58 |


Finally, the AI-powered summarization feature provided a concise overview of the book, highlighting major events, themes, and character interactions. By splitting the text into manageable chunks and summarizing each, the system generated a coherent summary that captures the essence of the narrative without manually reading the entire text. This demonstrates the value of AI tools in complementing traditional text analysis methods, allowing for both quantitative insights and high-level narrative understanding.

**Example AI-Generated Summary:**
"Moses was the founder of the nation and the type of the saviour talboys milton desires for his work all qualities of style as the variable subject required them . The author of the pentateuch required that he should be named but this in particular that moses was . the historian of the creation and fall north . The seventh book of the paradise lost opens with an invocation for aid and again to the same person we find in the opening verses the personality attributed with increased distinctness and with much increased boldness a proper name is given and a new imaginary person introduced . The poet redeems the boldness of adventurously transplanting from a pagan mythology into a christian poem ."

These results collectively showcase how combining classical NLP techniques with AI-powered summarization can provide both detailed quantitative analysis and human-readable insights into literary texts.


## 4. Reflection

The project went well in structuring a clear workflow for cleaning, word frequency analysis, visualization, and sentiment analysis. The biggest challenge was handling large text inputs for AI summarization, which was solved by splitting the text into smaller chunks. Overall, the project was appropriately scoped, and testing at each stage ensured reliable outputs.

My biggest learning takeaway was how classical NLP and AI models can complement each other to provide both quantitative and qualitative insights. ChatGPT helped troubleshoot NLTK issues, optimize processing, and implement summarization. I plan to apply these techniques to larger texts in future projects, and knowing transformer token limits earlier would have saved time during implementation.


Please read the [instructions](instructions.md).
Binary file added a_perfect_gentleman.pkl
Binary file not shown.
150 changes: 150 additions & 0 deletions analyze_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
"""
Part 2 Analyzing Your Text with Optional NLP + AI Summary
Author: Sophia Pak
"""

import re
import pickle
from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize

# Importing transformers; fallback if unavailable
try:
from transformers import pipeline
HAS_TRANSFORMERS = True
except ImportError:
HAS_TRANSFORMERS = False
print("'transformers' not found. Using lightweight built-in summarizer instead.")

# NLTK setup
nltk.download('stopwords')
nltk.download('vader_lexicon')
nltk.download('punkt')


# 1. Text Cleaning and Preprocessing
def clean_text(text: str) -> str:
"""Clean and preprocess text."""
start = re.search(r'\*\*\* START OF .* \*\*\*', text)
end = re.search(r'\*\*\* END OF .* \*\*\*', text)
if start and end:
text = text[start.end():end.start()]
text = text.lower()
text = re.sub(r'[^a-z\s]', '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text


# 2. Removing Stop Words
def remove_stopwords(text: str) -> list:
stop_words = set(stopwords.words('english'))
words = text.split()
return [word for word in words if word not in stop_words]


# 3. Word Frequency Analysis
def get_word_frequencies(words: list) -> Counter:
return Counter(words)


# 4. Summary Statistics
def summary_statistics(words: list, word_freq: Counter):
total_words = len(words)
unique_words = len(set(words))
avg_word_length = sum(len(word) for word in words) / total_words
top_20_words = word_freq.most_common(20)

print(f"\nText Summary Statistics:")
print(f"Total words: {total_words}")
print(f"Unique words: {unique_words}")
print(f"Average word length: {avg_word_length:.2f}")
print("\nTop 20 most frequent words:")
for word, freq in top_20_words:
print(f"{word}: {freq}")
return top_20_words


# 5. Text-Based Bar Chart
def visualize_bar_chart(top_words: list, scale: int = 2):
print("\nWord Frequency Chart:")
for word, freq in top_words:
bar = "#" * (freq // scale)
print(f"{word.ljust(15)} | {bar} ({freq})")


# 6. Sentiment Analysis
def analyze_sentiment(text: str, preview_sentences: int = 5, max_words: int = 20):
sentences = sent_tokenize(text)
sia = SentimentIntensityAnalyzer()
compound_scores = [sia.polarity_scores(s)['compound'] for s in sentences]
overall_score = sum(compound_scores) / len(compound_scores)

print("\nExample sentence sentiment scores:")
for s in sentences[:preview_sentences]:
snippet = " ".join(s.split()[:max_words])
print(f"→ {snippet}...")
print(f" Sentiment: {sia.polarity_scores(s)}\n")

print(f"Overall compound sentiment score: {overall_score:.3f}")
return overall_score


# 7. AI Text Summarization (with fallback)
def ai_summary(text: str, max_words: int = 300):
print("\nGenerating AI summary...")

if HAS_TRANSFORMERS:
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

words = text.split()
chunk_size = 900 # safe margin below token limit
chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

summaries = []
for i, chunk in enumerate(chunks[:5]):
print(f" Summarizing chunk {i + 1}/{min(len(chunks),5)}...")
try:
summary = summarizer(chunk, max_length=150, min_length=50, do_sample=False)
summaries.append(summary[0]['summary_text'])
except Exception as e:
print(f"Skipping chunk {i+1}: {e}")

combined_summary = " ".join(summaries)
print("\nAI-Generated Summary:\n")
print(combined_summary)
print("-" * 80)
return combined_summary

else:
sentences = re.split(r'(?<=[.!?]) +', text)
words = re.findall(r'\w+', text.lower())
freq = Counter(words)
scores = {s: sum(freq[w] for w in re.findall(r'\w+', s.lower())) for s in sentences}
top = sorted(scores, key=scores.get, reverse=True)[:5]

print("\nKeyword-Based Summary (no transformers):\n")
for s in top:
print("-", s.strip())
print("-" * 80)
return " ".join(top)


# 8. Main Analysis Workflow
def main_analysis(filename="a_perfect_gentleman.pkl"):
with open(filename, "rb") as f:
text = pickle.load(f)

clean = clean_text(text)
words_filtered = remove_stopwords(clean)
word_freq = get_word_frequencies(words_filtered)
top_words = summary_statistics(words_filtered, word_freq)
visualize_bar_chart(top_words)
analyze_sentiment(clean, preview_sentences=5, max_words=20)
ai_summary(clean)


if __name__ == "__main__":
main_analysis()
53 changes: 53 additions & 0 deletions harvest_text.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""
Part 1 Harvesting Text from the Internet
Author: Sophia Pak
"""

import urllib.request
import os
import pickle


def download_gutenberg_text(url: str) -> str:
"""Download plain text from Project Gutenberg."""
try:
with urllib.request.urlopen(url) as f:
text = f.read().decode("utf-8")
print("Successfully downloaded text from Project Gutenberg.")
return text
except Exception as e:
print("Error downloading Gutenberg text:", e)
return ""


def save_text(text: str, filename: str):
"""Save text locally as a pickle file."""
with open(filename, "wb") as f:
pickle.dump(text, f)
print(f"Text saved to {filename}")


def load_text(filename: str) -> str:
"""Load text from a pickle file."""
with open(filename, "rb") as f:
text = pickle.load(f)
print(f"Loaded text from {filename}")
return text


def main():
# *A Perfect Gentleman* by Ralph Hale Mottram (public domain)
gutenberg_url = "https://www.gutenberg.org/cache/epub/72949/pg72949.txt"
filename = "a_perfect_gentleman.pkl"

# Download and save if not already downloaded
if not os.path.exists(filename):
text = download_gutenberg_text(gutenberg_url)
if text:
save_text(text, filename)
else:
text = load_text(filename)


if __name__ == "__main__":
main()
Binary file added images/AI_image1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/AI_image2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/AI_image3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.