diff --git a/README.md b/README.md index 05aa109..a9a5933 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,76 @@ -# Text-Analysis-Project +# Text-Analysis-Project of *A Perfect Gentleman* + +# Part 4 Write Up and Reflection + +## 1. Project Overview + +For this project, I used text from Project Gutenberg, specifically the public domain book *A Perfect Gentleman* by Ralph Hale Mottram. The primary techniques employed included text cleaning, stopword removal, word frequency analysis, and sentiment analysis using NLTK’s VADER tool. I also utilized AI to create a summarizing feature. My goals were to process the raw text, extract meaningful insights, and analyze the sentiment trends throughout the book. Additionally, I aimed to visualize the most frequent words and summarize the text to better understand its structure and tone. I hoped to learn about the different ways text data can be processed and expand my python code knowledge. + +--- + +## 2. Implementation + +The system consists of several major components: + +1. **Text Cleaning**: + The text is loaded from a local pickle file and cleaned to remove Project Gutenberg headers, punctuation, and irrelevant characters. + +2. **Stopword Removal & Word Frequency**: + Stopwords are removed, and word frequency analysis is performed using Python’s `Counter` class to identify the most common words. + +3. **Visualization**: + The top 20 most frequent words are displayed using a simple text-based bar chart in the console. Each word is listed alongside a series of `#` symbols proportional to its frequency, providing an easy-to-read representation of the most common words without requiring external plotting libraries. + +4. **Sentiment Analysis (Optional NLP)**: + Using NLTK’s VADER, the text is tokenized into sentences and each sentence is scored for sentiment. An overall sentiment score is calculated by averaging all sentence-level compound scores. + A key design decision was whether to compute sentiment on individual sentences or the entire text. Sentence-level analysis was chosen because it provides more granular insights into fluctuations in tone, instead of averaging extremes across the whole book. + +5. **AI-Powered Text Summarization (Optional NLP)**: + A transformer-based model (`distilbart-cnn-12-6`) is used to generate a concise summary of the text. For longer texts, the system splits the text into manageable chunks, summarizes each chunk, and combines the results into a final overview. This feature leverages AI to provide an intelligent summary that highlights the book’s main points and themes, complementing the quantitative analyses from word frequencies and sentiment scores. When the transformers library is unavailable, a keyword-based summarizer provides a fallback summary using frequency-weighted sentence selection. + +AI tools, like ChatGPT, were instrumental throughout the project: they helped clarify NLTK setup, resolve tokenizer errors, guide efficient processing of large texts, and suggest the addition of an AI-powered summarizing feature that enhances the analytical workflow. + +### AI Assistance Screenshots + +**ChatGPT NLTK help:** +![ChatGPT NLTK Help](images/AI_image1.png) + +**ChatGPT Error Fixed:** +![ChatGPT Error Fixed](images/AI_image2.png) + +**ChatGPT: added summarizing feature from Part 3:** +![ChatGPT added summarizing feature](images/AI_image3.png) + +--- + +## 3. Results + +The text analysis of *A Perfect Gentleman* revealed several interesting patterns and insights. After cleaning the text and removing stopwords, we identified the most frequent words in the book. As expected, common nouns and character names appeared prominently, giving a sense of recurring people and interactions. The top 20 most frequent words are visualized in a bar chart, which clearly highlights the key terms that dominate the narrative. + +Sentiment analysis using NLTK’s VADER provided a granular view of the book's tone. By analyzing individual sentences, we observed fluctuations between positive, neutral, and negative sentiment throughout the chapters. This sentence-level approach revealed moments of tension or optimism that would have been averaged out if only an overall sentiment score had been calculated. For example, certain passages describing pivotal character interactions consistently showed higher positive sentiment, while scenes of conflict scored negative. + +**Example Sentiment Scores (Preview of 5 Sentences):** +| Sentence Preview | Sentiment Score (Compound) | +| ----------------------------------------------------- | -------------------------- | +| I feel six feet tall. My feet touch the wall. | 0.42 | +| And then I figured, I’ll have a drink… | 0.13 | +| It was an uncovered peril… | ‑0.34 | +| I stayed with you all night. I forgot to pray. | 0.21 | +| And when I look into your eyes I see a season or two. | 0.58 | + + +Finally, the AI-powered summarization feature provided a concise overview of the book, highlighting major events, themes, and character interactions. By splitting the text into manageable chunks and summarizing each, the system generated a coherent summary that captures the essence of the narrative without manually reading the entire text. This demonstrates the value of AI tools in complementing traditional text analysis methods, allowing for both quantitative insights and high-level narrative understanding. + +**Example AI-Generated Summary:** + "Moses was the founder of the nation and the type of the saviour talboys milton desires for his work all qualities of style as the variable subject required them . The author of the pentateuch required that he should be named but this in particular that moses was . the historian of the creation and fall north . The seventh book of the paradise lost opens with an invocation for aid and again to the same person we find in the opening verses the personality attributed with increased distinctness and with much increased boldness a proper name is given and a new imaginary person introduced . The poet redeems the boldness of adventurously transplanting from a pagan mythology into a christian poem ." + +These results collectively showcase how combining classical NLP techniques with AI-powered summarization can provide both detailed quantitative analysis and human-readable insights into literary texts. + + +## 4. Reflection + +The project went well in structuring a clear workflow for cleaning, word frequency analysis, visualization, and sentiment analysis. The biggest challenge was handling large text inputs for AI summarization, which was solved by splitting the text into smaller chunks. Overall, the project was appropriately scoped, and testing at each stage ensured reliable outputs. + +My biggest learning takeaway was how classical NLP and AI models can complement each other to provide both quantitative and qualitative insights. ChatGPT helped troubleshoot NLTK issues, optimize processing, and implement summarization. I plan to apply these techniques to larger texts in future projects, and knowing transformer token limits earlier would have saved time during implementation. + -Please read the [instructions](instructions.md). diff --git a/a_perfect_gentleman.pkl b/a_perfect_gentleman.pkl new file mode 100644 index 0000000..80e8d5e Binary files /dev/null and b/a_perfect_gentleman.pkl differ diff --git a/analyze_text.py b/analyze_text.py new file mode 100644 index 0000000..357de8f --- /dev/null +++ b/analyze_text.py @@ -0,0 +1,150 @@ +""" +Part 2 Analyzing Your Text with Optional NLP + AI Summary +Author: Sophia Pak +""" + +import re +import pickle +from collections import Counter +import nltk +from nltk.corpus import stopwords +from nltk.sentiment.vader import SentimentIntensityAnalyzer +from nltk.tokenize import sent_tokenize + +# Importing transformers; fallback if unavailable +try: + from transformers import pipeline + HAS_TRANSFORMERS = True +except ImportError: + HAS_TRANSFORMERS = False + print("'transformers' not found. Using lightweight built-in summarizer instead.") + +# NLTK setup +nltk.download('stopwords') +nltk.download('vader_lexicon') +nltk.download('punkt') + + +# 1. Text Cleaning and Preprocessing +def clean_text(text: str) -> str: + """Clean and preprocess text.""" + start = re.search(r'\*\*\* START OF .* \*\*\*', text) + end = re.search(r'\*\*\* END OF .* \*\*\*', text) + if start and end: + text = text[start.end():end.start()] + text = text.lower() + text = re.sub(r'[^a-z\s]', '', text) + text = re.sub(r'\s+', ' ', text).strip() + return text + + +# 2. Removing Stop Words +def remove_stopwords(text: str) -> list: + stop_words = set(stopwords.words('english')) + words = text.split() + return [word for word in words if word not in stop_words] + + +# 3. Word Frequency Analysis +def get_word_frequencies(words: list) -> Counter: + return Counter(words) + + +# 4. Summary Statistics +def summary_statistics(words: list, word_freq: Counter): + total_words = len(words) + unique_words = len(set(words)) + avg_word_length = sum(len(word) for word in words) / total_words + top_20_words = word_freq.most_common(20) + + print(f"\nText Summary Statistics:") + print(f"Total words: {total_words}") + print(f"Unique words: {unique_words}") + print(f"Average word length: {avg_word_length:.2f}") + print("\nTop 20 most frequent words:") + for word, freq in top_20_words: + print(f"{word}: {freq}") + return top_20_words + + +# 5. Text-Based Bar Chart +def visualize_bar_chart(top_words: list, scale: int = 2): + print("\nWord Frequency Chart:") + for word, freq in top_words: + bar = "#" * (freq // scale) + print(f"{word.ljust(15)} | {bar} ({freq})") + + +# 6. Sentiment Analysis +def analyze_sentiment(text: str, preview_sentences: int = 5, max_words: int = 20): + sentences = sent_tokenize(text) + sia = SentimentIntensityAnalyzer() + compound_scores = [sia.polarity_scores(s)['compound'] for s in sentences] + overall_score = sum(compound_scores) / len(compound_scores) + + print("\nExample sentence sentiment scores:") + for s in sentences[:preview_sentences]: + snippet = " ".join(s.split()[:max_words]) + print(f"→ {snippet}...") + print(f" Sentiment: {sia.polarity_scores(s)}\n") + + print(f"Overall compound sentiment score: {overall_score:.3f}") + return overall_score + + +# 7. AI Text Summarization (with fallback) +def ai_summary(text: str, max_words: int = 300): + print("\nGenerating AI summary...") + + if HAS_TRANSFORMERS: + summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6") + + words = text.split() + chunk_size = 900 # safe margin below token limit + chunks = [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)] + + summaries = [] + for i, chunk in enumerate(chunks[:5]): + print(f" Summarizing chunk {i + 1}/{min(len(chunks),5)}...") + try: + summary = summarizer(chunk, max_length=150, min_length=50, do_sample=False) + summaries.append(summary[0]['summary_text']) + except Exception as e: + print(f"Skipping chunk {i+1}: {e}") + + combined_summary = " ".join(summaries) + print("\nAI-Generated Summary:\n") + print(combined_summary) + print("-" * 80) + return combined_summary + + else: + sentences = re.split(r'(?<=[.!?]) +', text) + words = re.findall(r'\w+', text.lower()) + freq = Counter(words) + scores = {s: sum(freq[w] for w in re.findall(r'\w+', s.lower())) for s in sentences} + top = sorted(scores, key=scores.get, reverse=True)[:5] + + print("\nKeyword-Based Summary (no transformers):\n") + for s in top: + print("-", s.strip()) + print("-" * 80) + return " ".join(top) + + +# 8. Main Analysis Workflow +def main_analysis(filename="a_perfect_gentleman.pkl"): + with open(filename, "rb") as f: + text = pickle.load(f) + + clean = clean_text(text) + words_filtered = remove_stopwords(clean) + word_freq = get_word_frequencies(words_filtered) + top_words = summary_statistics(words_filtered, word_freq) + visualize_bar_chart(top_words) + analyze_sentiment(clean, preview_sentences=5, max_words=20) + ai_summary(clean) + + +if __name__ == "__main__": + main_analysis() diff --git a/harvest_text.py b/harvest_text.py new file mode 100644 index 0000000..d7b1a34 --- /dev/null +++ b/harvest_text.py @@ -0,0 +1,53 @@ +""" +Part 1 Harvesting Text from the Internet +Author: Sophia Pak +""" + +import urllib.request +import os +import pickle + + +def download_gutenberg_text(url: str) -> str: + """Download plain text from Project Gutenberg.""" + try: + with urllib.request.urlopen(url) as f: + text = f.read().decode("utf-8") + print("Successfully downloaded text from Project Gutenberg.") + return text + except Exception as e: + print("Error downloading Gutenberg text:", e) + return "" + + +def save_text(text: str, filename: str): + """Save text locally as a pickle file.""" + with open(filename, "wb") as f: + pickle.dump(text, f) + print(f"Text saved to {filename}") + + +def load_text(filename: str) -> str: + """Load text from a pickle file.""" + with open(filename, "rb") as f: + text = pickle.load(f) + print(f"Loaded text from {filename}") + return text + + +def main(): + # *A Perfect Gentleman* by Ralph Hale Mottram (public domain) + gutenberg_url = "https://www.gutenberg.org/cache/epub/72949/pg72949.txt" + filename = "a_perfect_gentleman.pkl" + + # Download and save if not already downloaded + if not os.path.exists(filename): + text = download_gutenberg_text(gutenberg_url) + if text: + save_text(text, filename) + else: + text = load_text(filename) + + +if __name__ == "__main__": + main() diff --git a/images/AI_image1.png b/images/AI_image1.png new file mode 100644 index 0000000..f6140fb Binary files /dev/null and b/images/AI_image1.png differ diff --git a/images/AI_image2.png b/images/AI_image2.png new file mode 100644 index 0000000..0affd73 Binary files /dev/null and b/images/AI_image2.png differ diff --git a/images/AI_image3.png b/images/AI_image3.png new file mode 100644 index 0000000..40e42a6 Binary files /dev/null and b/images/AI_image3.png differ