Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,44 @@
# Text-Analysis-Project

Please read the [instructions](instructions.md).

1. Project Overview (1 paragraph)

As an avid reader and literature lover, I decided to choose Project Gutenberg as my source. While browsing throught the titles, I settled on the following books:
- A Tale of Two Cities by Charles Dickens
- The Great Gatsby by F. Scott Fitzgerald
- The Hound of the Baskervilles by Arthur Conan Doyle
- Little Women by Louisa May Alcott
- Peter Pan by J.M. Barrie

My goal was to create a short text by combining words from each piece of literature, which I accomplished in the "combining-texts" file.
In addition to implementing Markov Text Synthesis, I analysed the frequency of words for each text in the "word-frequences" file. And lastly, I summarised the statistics of the top ten words that appeared in each book.

2. Implementation (1-2 paragraphs)

The implementation aspec that I struggled with the most was for the Markov Text section because I had very little idea about how to implement it. However, since I knew what my end goal was (to create short texts from multiple stories), I created a basic coding outline and asked GenAI for more intricate details. In the "images" folder I have attached photos of my interactions with tools like ChatGPT, Claude, or Microsoft Copilot.

There were several times where I had to reevaluate my code and thought processes because I would think "I'm making it more complicated than it needs to be." An instance of this is when I attempted to introduce punctuation in the "combining-texts" file, which made the output texts expremely confusing and unappealing. Throughout the project, I also consulted with resources from our class (ex: recordings), GitHub, and platforms like Stack Overflow.

3. Results (1-3 paragraphs + figures and examples)

I loved the results for Homework 2 because they were creative and analytical. For example, I loved reading the different, twenty-five word texts generated by Markov's Text Synthesis. What made them unique is that they were composed by characters, plots, and verbs of each book, which made them incredibly unique. Here are a few examples:

"your whole morning for nothing thought jo as she rang the bell half an hour later than usual and stood hot tired and dispirited surveying"

"cigarette was trembling suddenly she threw the cigarette and the burning match on the carpet “oh you want too much ” she cried to gatsby"

"of this and there s no reason you should all die of a surfeit because i ve been a fool cried amy wiping her eyes"

What is really interesting about these examples is that they read as normal stories. It is true that there are some grammatical errors, but their storyline is consistent. I also find it amusing when different eras interact. For example, A Tale of Two Cities is written with an older version of english, while The Great Gatsby is more modern, so the outputs are sometimes a mix of both versions.

Meanwhile, for the Word Frequencies Analysis, it helped me realised how some words can have such a great impact and be used scarcely, and vice versa. For instance,the word "dust" is only used six times in the story, but it is a crucial aspect of the plot because it is what allows them to fly.

Regarding the Summary Statistics, the Top 10 words tended to be modal verbs (ex: could and would), character names, and the verb "said."


4. Reflection (1-2 paragraphs)

Overall, I really enjoyed this project because it allowed me to see how Python can be used for Data Analysis. Throughout the process I ran into a few issues, especially when coding the Summary Statistics and Combining Texts (Markov Test). I kept running into issues because I would not get the output that I wanted. In the case of the Summary Statistics, I would get a list of the whole ranking with every word included. It wasn't until I submitted it to Claude that the reason for this was because I hadn't deleted "print frequencies" from the Word Frequencies section. Meanwhile, for the Markov Test, I didn't have a clear understanding as to how to code it, but because I had a vision I wanted to implement, I was able to ask Microsoft Copilot for help. I provided the code I had developed, my request, and any special comments (ex: "please keep the coding language simple so I can understand it"). Thanks to consulting with AI I have a deeper understanding as to how Text Analysis works and how to interpret code that makes it possible.

Overall, I also experienced issues with the NLTK library because every time I attempted to download it, I kept getting a warning. Eventually, thanks to asking ChatGPT multiple times and rewatching class recordings, I got it to work. Although I struggled with certain concepts and invested a lot of time in debugging, my biggest takeaway from this project is that Python is really cool and versatile. Thanks to my project scope, my creative goal was never negatively influenced.
Binary file added images/.DS_Store
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
111 changes: 111 additions & 0 deletions markov-text-synthesis/combining-texts.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
import urllib.request
import ssl
import re
import nltk
from nltk.corpus import stopwords
from collections import defaultdict, Counter
import random

# List of URLs of text files from Project Gutenberg
gutenberg_urls = [
'https://www.gutenberg.org/cache/epub/98/pg98.txt',
'https://www.gutenberg.org/cache/epub/64317/pg64317.txt',
'https://www.gutenberg.org/cache/epub/2852/pg2852.txt',
'https://www.gutenberg.org/cache/epub/37106/pg37106.txt',
'https://www.gutenberg.org/cache/epub/16/pg16.txt'
]

# Headers to seem like a browser request
headers = {
"User-Agent": "Mozilla/5.0"
}

# Ensure the stopwords corpus is downloaded
try:
stopwords.words('english')
except LookupError:
nltk.download('stopwords')

# Create an unverified SSL context
context = ssl._create_unverified_context()

def strip_headers(text):
"""Remove the Project Gutenberg header and footer from the text."""

# Find the start of the main content
start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE)
if start:
text = text[start.end():]

# Find the end of the main content
end = re.search(r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*|\[Illustration\]", text, re.IGNORECASE)
if end:
text = text[:end.start()]

return text.strip()

def find_words(text):
"""Convert text to lowercase, remove punctuation, and split into words."""
# Convert text to lowercase
text = text.lower()

# Replace punctuation with spaces
for char in '.,!?;:()[]{}""\'-_':
text = text.replace(char, ' ')

# Remove Roman numerals
text = re.sub(r'\b(?:[IVXLCDM]+)\b', '', text)

return text.split()

def create_markov_model(tokens, n=1):
"""Create a Markov model. AI helped a lot with this section because I did not know what to do"""

model = defaultdict(Counter)
for i in range(len(tokens) - n):
state = tuple(tokens[i:i+n])
next_state = tokens [i+n]
model[state] [next_state] += 1
return model

def generate_text(model, length=100, n=1):
"""Generate text using the Markov model."""

# Random starting state
state = random.choice(list(model.keys()))
output = list(state)

for _ in range(length - n):
# If the state is not in the model or has no next states, choose a new random state
if state not in model or not model [state]:
state = random.choice(list(model.keys()))

# Choose the next state based on the probabilities in the model
next_state = random.choices(list(model[state].keys()), list(model[state].values()))[0]
output.append(next_state)
state = tuple(output[-n:])

return ' '.join(output)

def main():
all_tokens = []

# Process each URL in the list, clean it, and print any errors if necessary
for url in gutenberg_urls:
req = urllib.request.Request(url, headers=headers)
try:
with urllib.request.urlopen(req, context=context) as f:
raw_text = f.read().decode('utf-8')
clean_text = strip_headers(raw_text)
tokens = find_words(clean_text)
all_tokens.extend(tokens)
except Exception as e:
print(f"An error occurred with {url}: {e}")

# Create a Markov model and generate text
markov_model = create_markov_model(all_tokens, n=5)
generated_text = generate_text(markov_model, length=25, n=5)
print(generated_text)

if __name__ == "__main__":
main()
85 changes: 85 additions & 0 deletions summary-statistics/a-tale-of-two-cities-ss.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import urllib.request
import ssl
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter

# URL of the text file from Project Gutenberg
url = 'https://www.gutenberg.org/cache/epub/98/pg98.txt'

# Headers to seem like a browser request
headers = {
"User-Agent": "Mozilla/5.0"
}

# Ensure stopword corpus is downloaded
try:
stopwords.words('english')
except LookupError:
nltk.download('stopwords')

# Create request object with specified URL and headers
req = urllib.request.Request(url, headers = headers)

# Create an unverified SSL context
context = ssl._create_unverified_context()

def strip_headers(text):
"""Remove the Project Gutenberg header and footer from the text."""

# Find the start of the main content
start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE)
if start:
text = text[start.end():]

# Find the end of the main content
end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE)
if end:
text = text[:end.start()]

return text.strip()

def word_frequencies(text):
"""Calculate the frequency of each word in the text, excluding stopwords."""
# Convert text to lowercase
text = text.lower()

# Replace punctuation with spaces
for char in '.,!?;:()[]{}"\'-_':
text = text.replace(char, ' ')

# Split text into words
words = text.split()

# Remove stopwords and single-character words
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words and len(word) > 1]

return Counter(words)

def main():
try:
# Open the URL and read the raw text
with urllib.request.urlopen(req, context=context) as f:
raw_text = f.read().decode('utf-8')

# Remove headers and footers from the text
clean_text = strip_headers(raw_text)

# Calculate word frequencies
frequencies = word_frequencies(clean_text)

# Get the top 10 most frequent words
top_10 = frequencies.most_common(10)
print(f"Length of top_10: {len(top_10)}")
print("Top 10 most frequent words:")
for word, count in top_10:
print(f"{word}: {count}")

except Exception as e:
# Print any errors that occur
print("An error occurred:", e)

if __name__ == "__main__":
main()
85 changes: 85 additions & 0 deletions summary-statistics/great-gatsby-ss.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import urllib.request
import ssl
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter

# URL of the text file from Project Gutenberg
url = 'https://www.gutenberg.org/cache/epub/64317/pg64317.txt'

# Headers to seem like a browser request
headers = {
"User-Agent": "Mozilla/5.0"
}

# Ensure stopword corpus is downloaded
try:
stopwords.words('english')
except LookupError:
nltk.download('stopwords')

# Create request object with specified URL and headers
req = urllib.request.Request(url, headers = headers)

# Create an unverified SSL context
context = ssl._create_unverified_context()

def strip_headers(text):
"""Remove the Project Gutenberg header and footer from the text."""

# Find the start of the main content
start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE)
if start:
text = text[start.end():]

# Find the end of the main content
end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE)
if end:
text = text[:end.start()]

return text.strip()

def word_frequencies(text):
"""Calculate the frequency of each word in the text, excluding stopwords."""
# Convert text to lowercase
text = text.lower()

# Replace punctuation with spaces
for char in '.,!?;:()[]{}"\'-_':
text = text.replace(char, ' ')

# Split text into words
words = text.split()

# Remove stopwords and single-character words
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words and len(word) > 1]

return Counter(words)

def main():
try:
# Open the URL and read the raw text
with urllib.request.urlopen(req, context=context) as f:
raw_text = f.read().decode('utf-8')

# Remove headers and footers from the text
clean_text = strip_headers(raw_text)

# Calculate word frequencies
frequencies = word_frequencies(clean_text)

# Get the top 10 most frequent words
top_10 = frequencies.most_common(10)
print(f"Length of top_10: {len(top_10)}")
print("Top 10 most frequent words:")
for word, count in top_10:
print(f"{word}: {count}")

except Exception as e:
# Print any errors that occur
print("An error occurred:", e)

if __name__ == "__main__":
main()
Loading