OIM3640 · nidhiraju10 · Nov 8, 2025 · Nov 8, 2025 · Nov 8, 2025
diff --git a/README.md b/README.md
@@ -1,3 +1,43 @@
 # Text-Analysis-Project
+Project Title: Text Analysis Project – Alice in Wonderland and Frankenstein
 
-Please read the [instructions](instructions.md).
+# Project Overview
+
+For this project, I used two books from Project Gutenberg — Alice’s Adventures in Wonderland by Lewis Carroll and Frankenstein by Mary Shelley. The goal was to explore how Python can be used to analyze, compare, and visualize text data. I applied techniques such as text cleaning, stopword removal, word frequency analysis, and summary statistics. Through this I aimed to learn how language, tone and theme differ between these 2 distinct genres and allowed me to explore the deeper qualitative insights such as emotional tone.
+
+# Implementation
+
+The system is built with Python and Utilizes several libraries for different analysis techniques: 
+
+Text Cleaning: Unnecessary characters, punctuation, stopwords and headers were removed, and text was converted to lowercase.
+
+Word Frequency: The remaining words were counted with a Python Counter dictionary to identify the most common ones and its frequency and then I used the ACII bar chart to visualize the top 20 words. 
+
+Sentiment Analysis: NLTK’s SentimentIntensityAnalyzer determined emotional tone per text.
+
+Cosine Similarity: Scikit-learn’s TF-IDF vectorizer calculated how similar the two books were in vocabulary and themes.
+
+Design Decision: Instead of heavy plotting libraries, I used an ASCII bar chart for visualization.
+
+GenAI (chatgpt) has helped and guided me in optimizing the code 
+
+# Results
+The project acheived the following results:
+
+Word Frequency:
+Alice in Wonderland – Common words included said, Alice, little, Queen, and thought, reflecting a story driven by character dialogue and whimsical interactions.
+
+Frankenstein – Frequent words such as life, father, eyes, shall, and man indicate a more reflective and emotional tone centered on human experience and morality.
+
+Cosine Similarity:
+The similarity score between the two texts was 0.25, meaning limited overlap in vocabulary and subject matter which makes sense as they are in completely different genres.
+
+Sentiment Analysis:
+Alice in Wonderland had generally neutral to positive sentiment whereas,Frankenstein showed more negative or somber sentiment with words suggesting conflict, guilt or emotional struggle.
+
+Visualization:
+The ASCII bar chart clearly highlighted differences — Alice is dominated by character dialogue, while Frankenstein emphasizes abstract and emotional words.
+![alt text](image.png)
+
+# Reflection
+This project was both challenging and insightful. From a learning perspective, I realized the versatility of text analysis in understanding themes, sentiment, and content generation. Alice in Wonderland used simple, lively language with lots of dialogue, while Frankenstein had a heavier tone and more emotional depth. The low similarity score proved how different their writing styles really are. Cleaning the text, removing stopwords, and looking at word frequencies made me see how much detail is hidden in plain text. I also learned how sentiment analysis can capture the overall mood of a story without needing to read every line.
diff --git a/image.png b/image.png
diff --git a/main.py b/main.py
@@ -0,0 +1,130 @@
+import urllib.request, urllib.error, re, os, sys, json
+from collections import Counter
+from typing import List, Tuple, Dict
+
+import nltk
+from nltk.sentiment import SentimentIntensityAnalyzer
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+
+nltk.download('vader_lexicon')
+
+STOPWORDS = {
+    "a","an","the","and","or","but","if","then","else","when","while","of","to","in","on","for","from","by",
+    "with","as","at","is","are","was","were","be","been","being","that","this","it","its","they","them","their",
+    "she","her","he","his","you","your","i","we","us","our","not","no","do","does","did","so","such","than","too",
+    "very","can","could","should","would","may","might","will","just","my","me","him","her","his","hers","ours","theirs","our","your","yours",
+    "which","who","whom","whose","what","this","that","these","those","am","is","are","was","were","be","been","being","have","has","had","do","does","did"
+}
+
+def fetch_text(url: str) -> str:
+    req = urllib.request.Request(url, headers={"User-Agent": "TextAnalysisProject/1.0"})
+    with urllib.request.urlopen(req, timeout=40) as f:
+        return f.read().decode("utf-8", errors="ignore")
+
+def strip_gutenberg(text: str) -> str:
+    lines = text.splitlines()
+    start, end = 0, len(lines)
+    for i, line in enumerate(lines):
+        if "start of the project gutenberg" in line.lower():
+            start = i + 1
+            break
+    for j in range(len(lines)-1, -1, -1):
+        if "end of the project gutenberg" in lines[j].lower():
+            end = j
+            break
+    return "\n".join(lines[start:end])
+
+def clean_text(text: str) -> str:
+    text = re.sub(r"\[[^\]]*\]", " ", text)
+    text = re.sub(r"[^A-Za-z\s'\-]", " ", text)
+    text = re.sub(r"\s+", " ", text)
+    text = text.lower().strip()
+    text = re.sub(r"\b[a-zA-Z]\b", " ", text)
+    text = re.sub(r"\s+", " ", text).strip()
+    return text
+
+
+def tokenize(text: str) -> List[str]:
+    return [t for t in re.split(r"\s+", text) if t]
+
+def remove_stopwords(tokens: List[str]) -> List[str]:
+    return [t for t in tokens if t not in STOPWORDS]
+
+def word_frequencies(tokens: List[str]) -> Counter:
+    return Counter(tokens)
+
+def ascii_bar_chart(pairs: List[Tuple[str, int]], width: int = 50) -> str:
+    maxcount = max(c for _, c in pairs) or 1
+    return "\n".join(f"{w:>15} | {'#' * int((c / maxcount) * width)} {c}" for w, c in pairs)
+
+def summary_stats(tokens: List[str]) -> Dict[str, float]:
+    if not tokens:
+        return {"num_tokens": 0, "vocab_size": 0, "avg_word_len": 0.0}
+    lengths = [len(t) for t in tokens]
+    return {
+        "num_tokens": len(tokens),
+        "vocab_size": len(set(tokens)),
+        "avg_word_len": sum(lengths) / len(lengths),
+    }
+
+def sentiment_sample(text: str, take: int = 10):
+    sia = SentimentIntensityAnalyzer()
+    sentences = re.split(r'(?<=[.!?])\s+', text)
+    return [(s.strip(), sia.polarity_scores(s.strip())["compound"]) for s in sentences[:take] if s.strip()]
+
+def tfidf_similarity(text1: str, text2: str) -> float:
+    vec = TfidfVectorizer(
+        stop_words='english',
+        token_pattern=r'(?u)\b[a-zA-Z]{2,}\b',  
+        min_df=1,                               
+        max_df=1.0                              
+    )
+    X = vec.fit_transform([text1, text2])
+    S = cosine_similarity(X)
+    return float(S[0, 1])
+
+def main():
+    url = "https://www.gutenberg.org/cache/epub/11/pg11.txt"  # Alice in Wonderland
+    compare_url = "https://www.gutenberg.org/cache/epub/84/pg84.txt"  # Frankenstein
+
+    raw = fetch_text(url)
+    text = clean_text(strip_gutenberg(raw))
+    tokens = remove_stopwords(tokenize(text))
+    freqs = word_frequencies(tokens)
+    top_words = freqs.most_common(20)
+    stats = summary_stats(tokens)
+
+    print("\n=== Top Words ===")
+    print(ascii_bar_chart(top_words))
+    print("\n=== Summary Stats ===")
+    for k, v in stats.items():
+        print(f"{k}: {v}")
+
+    comp_text = clean_text(strip_gutenberg(fetch_text(compare_url)))
+    comp_tokens = remove_stopwords(tokenize(comp_text))
+    comp_freqs = word_frequencies(comp_tokens)
+    comp_top = comp_freqs.most_common(20)
+
+    print("\n=== Top Words (Comparison Book) ===")
+    print(ascii_bar_chart(comp_top))
+
+    similarity = tfidf_similarity(text, comp_text)
+    print(f"\nCosine Similarity with comparison text: {similarity:.3f}")
+
+    sentiment = sentiment_sample(text)
+
+    os.makedirs("data", exist_ok=True)
+    with open("data/top_words.txt", "w", encoding="utf-8") as f:
+        f.write(ascii_bar_chart(top_words))
+    with open("data/summary.json", "w", encoding="utf-8") as f:
+        json.dump({"top_words": top_words, "stats": stats, "similarity": similarity, "sentiment": sentiment}, f, indent=2)
+
+if __name__ == "__main__":
+    main()
+
+
+# AI Useage : 
+# I used chatGPT to help me structure the functions for text cleaning, and frequency analysis. 
+# I provided prompts describing the desired functionality, and ChatGPT generated code snippets which I then reviewed, tested, and modified as needed to fit the overall program. 
+# I also used ChatGPT to help debug some syntax errors and optimize certain parts of the code. However, I ensured that I understood all the code and made final decisions on implementing the code.