Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,47 @@
# Text-Analysis-Project

Please read the [instructions](instructions.md).

# Part 4: Project Writeup and Reflection

## 1. Project Overview
For my text analysis project, I used **Jane Eyre** by Charlotte Brontë from **Project Gutenberg** as the main dataset. I explored the text using several computational techniques including text cleaning, stop-word removal, word-frequency analysis, ASCII visualization, Markov analysis to generate new text based on learned word transitions,and sentiment analysis to analyze sentiment of a text.
The purpose of this project was to learn how to handle and analyze large text data step by step using fundamental Python logic from Think Python. I wanted to see how coding techniques like loops, conditionals, and dictionaries could reveal patterns in word use and emotional tone in literature. Through this process, I hoped to gain a deeper understanding of both text analysis and the emotional structure of Jane Eyre using only my own code and logic.

---

## 2. Implementation
The project was divided into several parts that work together.
First part, the text file (jane_eyre.txt) was downloaded directly from the site and saved locally. I used Python’s built-in open() function with UTF-8 encoding to read the file line by line.
This part is to clean the data part, the `read_data()` and `clean_text()` functions prepare the text by reading lines, removing control characters, punctuation, and converting everything to lowercase to better prepare with text analysis.Then, the cleaned text is tokenized into words, and stop words are removed using a manually built stop-word list and simple list loops. This is what we learned in class and in thinkpython.

The next part is to build analysis part. I first creates a **word-frequency dictionary** that counts how often each word appears and displays the top results through an **ASCII bar chart**. The frequency data was used to calculate summary statistics like total words, unique words, and vocabulary richness. For advanced analysis, i chose **Markov text synthesis** which is bigram model. it helps building a mapping from prefix (tuple of two words) to build list of possible next words; generate by repeatedly sampling and shifting the prefix window. For the optional exploration, I implemented a **sentiment analysis** module using a small manually defined sentiment lexicon (`SENT_LEX`) and a **rolling mean function** to observe trends in emotional tone throughout the novel.

A key design decision was choosing between using external libraries like `nltk` or writing my own functions with only built-in tools. I decided to code all parts manually (loops, regex, dictionaries) to deepen my understanding of the logic behind text analysis instead of relying on pre-built sentiment models.

Throughout the project, I used ChatGPT to clarify Python logic and debug problems and also learn how to write some code. I pasted code errors into ChatGPT to ask why my functions didn’t print results, how to normalize ASCII chart lengths, and how to handle negations in sentiment analysis. The chats helped me break large problems into smaller steps and understand each part before testing. (see more in part3 ai_write up)
#https://chatgpt.com/share/690e2452-6088-8013-81c5-a30cefefaab2
#https://chatgpt.com/share/690cf8a6-2e18-8013-a696-76f9fb747bf6


---

## 3. Results
For words summary staistic after cleaning and process the data, I found out that 12,635 words were unique, resulting in a vocabulary richness of 0.1378, meaning roughly one in seven words in the text is unique. The average word length was approximately 5.77 characters, which reflects the novel’s formal Victorian writing style and tendency toward longer words.

For analysis part i found out most frequent words are " now, will, one, said, s, mr, like, rochester, jane, well" , this make sense because "said,” “Mr.,” “Jane,” and “Rochester” reflect frequent dialogue and direct address between the novel’s two main characters, while words like “now,” “will,” and “well” indicate a focus on reflection and emotion, which aligns with the novel’s introspective tone. The **ASCII word-frequency bar chart** successfully helped tp the top 20 most common words in the novel, showing that emotional words appeared repeatedly.
The **sentiment analysis** found a **total sentiment score of 1353** with an **average of 0.122 per sentence**, meaning the novel had a slightly positive tone overall. and i learned that the relative sentiment distribution captured the general balance between emotionally charged and neutral passages. Sentences expressing affection, admiration, or gratitude tended to receive the highest scores. For instance, "My Edward and I, then, are happy.”, while sentences involving conflict or despair scored lowest, like “A cankering evil sat at my heart and drained my happiness.”
In addition, using a bigram-based Markov model, I generated new sentences that mimicked Jane Eyre’s writing style by predicting each word based on its preceding pair. The output captured common transitions such as “Mr. Rochester” and “said Jane,” reflecting the novel’s dialogue-heavy and emotional tone.This shows that Markov analysis can reproduce the surface rhythm and vocabulary patterns of a literary text without requiring deep linguistic understanding.

## Reflection
From a process point of view, one of the biggest challenges was figuring out how to get started and designing each function from scratch. In the beginning, I found it hard to understand how to clean text data properly and connect all the functions together, from reading and cleaning the text to building dictionaries and calculating word frequencies. To overcome this, I brainstormed each function’s role with the help of ChatGPT, breaking the large project into smaller parts such as read_data(), clean_text(), stopwords(), and dictionary(). Like i ask chatgpt to breakdown each projects to small steps.

Another challenge was debugging. Sometimes my output looked empty, or the ASCII bar chart didn’t display correctly. By testing each function step by step and checking small text samples, I learned how to verify data at every stage before combining them into the full pipeline. The sentiment analysis part was also complex at first; I had to learn how negations worked. Then i will ask ai to explain each code in simple words and if it write a complex one, i will tell him to code concept from thinkpython.

What went well was that once I had the core data cleaning and counting logic working, I could expand easily. It has same logic, both using for loop and add some tweaks. In addition, i learned adding Markov analysis and sentiment scoring as new modules. I ask chatgpt to explain those two advanced analysis concepts and learned from start how to do it, ask chat to explain each code. If I could improve one thing, I would expand the sentiment analysis by adding a more detailed lexicon and visualizing emotional trends across chapters to make results clearer. I would also refine my Markov model to generate longer, more coherent sentences that better reflect the novel’s tone.

From a learning perspective, this project helped me connect everything I’ve learned from Think Python into a real applied task, like using loops, conditionals, dictionaries, and string operations to analyze a full-length novel. My biggest takeaway is to actually practice the code we learned in class in real life. Another key takeaway was learning how basic functions I wrote for cleaning, counting, and filtering words could later be reused to build more advanced analyses like sentiment scoring and Markov text generation. Also project was appropriately scoped because it focused on one large text and applied progressively more advanced analysis methods without relying on external datasets or libraries. I kept the workflow simple enough to test each function separately, first checking text cleaning, then word counts, and finally sentiment and Markov generation. My testing plan involved printing intermediate outputs to verify each step worked correctly before moving to the next stage.

AI tools like ChatGPT played a major role in helping me clarify difficult logic and learn new methods. I asked questions such as how to remove control characters with unicodedata.category(), how to handle punctuation, and how to build my own sentiment lexicon. ChatGPT also helped me rewrite and simplify my code, making it more readable while explaining the reasoning behind each step. It also explain code pretty well, which help me understand what each code does.

Going forward, I will use what I learned to build more advanced text-analysis projects and perhaps incorporating visualization libraries or real world datasets. I now feel confident that I can design full analysis pipelines from raw text to summary statistics, and I better understand how AI assistance can guide learning without replacing critical thinking or testing.
176 changes: 176 additions & 0 deletions code/part3AI_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# part 3 exploration with AI, this part i use chatgpt to help me learn new approach sentiment text analysis


import unicodedata
import re


def read_data(filename):
"""Read all lines from a text file"""
with open(filename, "r", encoding="utf-8") as file: #r是只读file,
data = file.readlines()
print("Number of lines:", len(data))
print("Type:", type(data))
print("\nFirst 5 lines preview:\n", data[:5])
return data

def clean_text(lines):
""""Cleaning part: clean the text by removing punctuation, symbols, and special characters.
It keeps only letters and spaces, makes all letters lowercase,
and removes any empty lines. This helps prepare the text for analysis. """
clean = []
for line in lines:
new_line = ""
for character in line:
if unicodedata.category(character).startswith('C'):
# if character is c, meaning its controled character
continue
if character.isalpha() or character.isspace():
new_line += character.lower()
# if character is character, add it lowercase to new line
else:
new_line += ' '
"if character is puncation, replace with space"
new_line = new_line.strip()
if new_line != "": #with help in gpt checks if the cleaned line is not empty before appending.
clean.append(new_line)

text = ""
for line in clean:
text += line + " "
return text


# Sentiment helpers (Think Python style) ####with help from chatgpt
SENT_LEX = {
"love": 3, "loved": 3, "lovely": 3, "like": 2, "happy": 3, "joy": 3, "delight": 3,
"good": 2, "great": 3, "excellent": 4, "fortunate": 2, "kind": 2, "brave": 2,
"bad": -2, "worse": -3, "worst": -4, "hate": -3, "hated": -3, "angry": -3,
"sad": -2, "cry": -2, "cried": -2, "fear": -2, "terrible": -3, "horrible": -3,
"dead": -2, "death": -2, "poor": -2, "ugly": -2, "evil": -3, "wicked": -3
} #define sentiment metrics

NEGATIONS = {"not","no","never","none","hardly","barely","scarcely"}

def split_sentences(text):
"""Very simple rule-based splitter using punctuation boundaries."""
# Think Python approach: simple regex and list filtering
sents = re.split(r'[.!?]+\s*', text.strip()) # “Split the text at any sequence of whitespace that follows a sentence-ending punctuation mark.
result = [s.strip() for s in sents if s.strip() != ""]
return result

def tokenize_words(text):
"""Lowercase alphabetic tokens only, returns a list of words."""
return re.findall(r"[a-z]+", text.lower())

def sentiment_score(tokens, lexicon, negations, window=3):
"""
Sum lexicon scores; if a negation appears, flip the sign of the
next few scored tokens (simple window-based negation).
"""
score = 0
flip = 0
for t in tokens:
if t in negations:
flip = window
continue
s = lexicon.get(t, 0)
if flip > 0 and s != 0:
s = -s
score += s
if flip > 0:
flip -= 1
return score

def rolling_mean(values, k):
"""
Smooth the sequence with a simple moving average of window k.
No numpy: just accumulate in a list (Think Python style).
"""
out = []
window_sum = 0.0
q = []
for v in values:
q.append(v)
window_sum += v
if len(q) > k:
window_sum -= q.pop(0)
out.append(window_sum / len(q))
return out


def main():
filename = "jane_eyre.txt"
lines = read_data(filename)

# keep a raw version (with punctuation) for sentence splitting
raw_text = "".join(lines)

cleaned = clean_text(lines)

print("\n=== Clean Text Check ===")
print("Length of cleaned text:", len(cleaned))
print("First 3000 characters:\n", cleaned[:3000])
### with chatgpt's help, used to check if the cleaned text looks correct.

# --- Sentiment Analysis (Think Python style) ---
lexicon = dict(SENT_LEX) # copy so you can extend safely

# split sentences from the raw text (still has . ! ?)
sentences = split_sentences(raw_text)

# score each sentence
sent_scores = []
for s in sentences:
toks = tokenize_words(s)
sent_scores.append( sentiment_score(toks, lexicon, NEGATIONS, window=3) )

# overall/average
total_sent = sum(sent_scores)
avg_sent = total_sent / max(1, len(sentences))
print(f"\n[Sentiment] Total score = {total_sent:.2f}, Average per sentence = {avg_sent:.3f}")

# top positive/negative sentences (like Think Python’s “most frequent words” pattern)
# build a list of (score, sentence), then sort
scored_pairs = []
i = 0
while i < len(sentences):
scored_pairs.append( (sent_scores[i], sentences[i]) )
i += 1
scored_pairs.sort(key=lambda p: p[0], reverse=True)
print("\n[Sentiment] Top 5 positive sentences:")
j = 0
while j < 5 and j < len(scored_pairs):
sc, s = scored_pairs[j]
print(f" (+{sc:.1f}) {s[:180]}")
j += 1

scored_pairs.sort(key=lambda p: p[0]) # ascending for negative
print("\n[Sentiment] Top 5 negative sentences:")
j = 0
while j < 5 and j < len(scored_pairs):
sc, s = scored_pairs[j]
print(f" ({sc:.1f}) {s[:180]}")
j += 1

# rolling trend (window=50 sentences)
trend = rolling_mean(sent_scores, k=50)
print("\n[Sentiment] First 20 values of rolling mean (k=50):")
preview = []
t = 0
while t < 20 and t < len(trend):
preview.append(round(trend[t], 3))
t += 1
print(preview)


if __name__ == "__main__":
main()


# --- AI Assistance Acknowledgment ---
# Parts of this code (sentiment analysis and tokenization explanations)
# were developed with the help of ChatGPT (OpenAI, 2025).
# ChatGPT conversation link:
# https://chatgpt.com/share/690e2452-6088-8013-81c5-a30cefefaab2

Loading