Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,71 @@
# Text-Analysis-Project

Please read the [instructions](instructions.md).

Project Overview
I decided to use the histograms and Project Gutenburg to analyze and reak down two books. The books being East of Eden and The Great Gatsby. I also decided to look into movie reviews for the same books. I was hoping to learn more about API and data mining. I was also curios on looking into two of my favorite books

Implementation
The first part of my code loads and downloads the books into text and creates a list of stop words which I will use for data mininig. I then created historgrams and see which words appear the most (excluding stop words). This all fits togeher as i can compare the books and furhter analyze them such as which words are used in each book I also pickled my data by trying to add the texts to one list. This allowed my to have the two texts in one code.

The next part of my code was the movie reviews but I could not get it to work. I have used AI to fix my code and ask for inut, howver, AI beleives it is a library error. I also experimented with the OPen AI API but faced challenges with API authentication. I asked AI to see where the issue was and since it was my key. It asked me to implement my key to a .emv dile on gitignore. Then import the libraries. After importing these, my key was working.

Results
Some of my results are the top twenty words from each book
The top 20 words in East of Eden are:

eva: 463
us: 203
“i: 193
over: 192
might: 188
did: 187
know: 169
like: 163
see: 163
van: 162
your: 154
must: 154
eyes: 147
never: 143
nicholas: 142
because: 142
work: 139
down: 139
say: 138

The top 20 words in The Great Gatsby are:
gatsby: 189
“i: 175
tom: 173
daisy: 143
over: 141
like: 118
down: 118
came: 108
back: 108
man: 105
any: 102
little: 102
know: 96
just: 94
house: 92
before: 91
now: 91
went: 91
after: 86

I also have a MDS scatterplot that shows the correlation of 5 chunchs for the two books. I used fuzzy to show the similarities from 0 to 1where 0 is no correlatoin and 1 is exactly the same. This is the graph
![alt text](image.png)
![alt text](image-4.png)
For the pickling, I was able to see the two books into one code Howver, I orginilnally did not have the list, so i asked AI to walk me thorugh the list. Then, fix my errors. Here is some inputs
![alt text](image-1.png)
![alt text](image-2.png)

I also used Claude AI for making sure my code is correct since I did try to implement extra steps. However this was more complicated. Here, i asked Claude to help me with the movie reviews before messaging proffesor
![alt text](image-3.png)

Reflection
From my point of view, there were a lot of things that went well, such as extracting and getting the books into a histogram, however, some things did not go the way I expected. My first issue was trying to remove the punctuations and spaces and special characters. I asked Ai for this but got something complicated, So i decedd to look for a simpler solution. I also asked my brother to assit me and looking over my code and giving me suggestions. Howeverm his feedback was outside my knowledge learned in OIM 3640
Another big challenge was understanding the structure of the Cinemagoer. I orignially did not want to look into movies as I was already having problems in Part 1. However, I wanted to challenge myself and get a sense of accomplishment by going over the movies of the two books. One thing I learned was properly using the .get function with default values to prevent crashes when keys were not present. I also wanted to try the NPL with the charts. This was very sucessful after I learned how to merge a list into a dictionary and then pickle. I do beleive my project was scoped to a small extent as it was simple and effective. I did not feel too comfortable doing something complex. My testing plan was commenting out parts of my code and then running what I had.
My biggest Takeaway was learning how to process and analyze text data into NPL. I also learned how to tokenize and extract specific data such as stop words. I also learned how to organize text to create dictionaries andreveal patterns such as similar words. The biggest thing however, was learning about each external resrouces and how to handle such as Reddit and Cinemagoer libraries. Going forward, I will use more NPL techniques and try to get developer accounts for reddit and twitter. I will also try to compare the book and movies to each otehr rather than the two books to each other. I wish I studied more and looked over the JSOn files and started earlier
219 changes: 219 additions & 0 deletions analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
import random
import string
import sys
from unicodedata import category



def process_file(filename, skip_header):
"""Makes a histogram that counts the words from a file.

filename: string
skip_header: boolean, whether to skip the Gutenberg header

returns: map from each word to the number of times it appears.
"""
hist = {}
fp = open(filename, encoding="utf-8")

if skip_header:
skip_gutenberg_header(fp)

# strippables = string.punctuation + string.whitespace
strippables = "".join(
chr(i) for i in range(sys.maxunicode) if category(chr(i)).startswith("P")
) # Unicode punctuation characters. Ref: https://stackoverflow.com/a/60983895

for line in fp:
if line.startswith("*** END OF THE PROJECT"):
break

line = line.replace("-", " ")
line = line.replace(chr(8212), " ") # Em dash replacement

for word in line.split():
word = word.strip(strippables)
word = word.lower()

hist[word] = hist.get(word, 0) + 1

fp.close()
return hist


def skip_gutenberg_header(fp):
"""Reads from fp until it finds the line that ends the header.

fp: open file object
"""
start_marker = "START OF THE PROJECT"

for line in fp:
if start_marker.lower() in line.lower(): # Case-insensitive search
return
# If the loop completes without finding the start marker
raise ValueError(f"Header end marker '{start_marker}' not found in file.")


def total_words(hist):
"""Returns the total of the frequencies in a histogram."""
return sum(hist.values())


def different_words(hist):
"""Returns the number of different words in a histogram."""
return len(hist)


def most_common(hist, excluding_stopwords=False):
"""Makes a list of word-freq pairs in descending order of frequency.

hist: map from word to frequency

returns: list of (frequency, word) pairs
"""
stopwords = set()
if excluding_stopwords:
stopwords = {
"a",
"and",
"at", "as",
"be",
"but",
"by",
"for",
"had",
"he",
"her",
"his",
"i",
"in",
"is",
"it",
"so",
"that",
"the",
"them",
"to",
"with",
"which",
}
# i searched up some stop words
t = []
for word, freq in hist.items():
if excluding_stopwords and word in stopwords:
continue
t.append((freq, word))

t.sort(reverse=True)
return t


def print_most_common(hist, num=10):
"""Prints the most commons words in a histgram and their frequencies.

hist: histogram (map from word to frequency)
num: number of words to print
"""
common = most_common(hist)
for freq, word in common[:num]:
print(word, "\t", freq)


def subtract(d1, d2):
"""Returns a dictionary with all keys that appear in d1 but not d2.

d1, d2: dictionaries
"""
result = {}
for key in d1:
if key not in d2:
result[key] = d1[key]
return result


def random_word(hist):
"""Chooses a random word from a histogram.

The probability of each word is proportional to its frequency.
"""
words = []
for word, freq in hist.items():
words.extend([word] * freq)
return random.choice(words)


def main():
# This text file is downloaded from gutenberg.org (https://www.gutenberg.org/cache/epub/1342/pg1342.txt)
hist = process_file("Parts/East of Eden.txt", skip_header=True)
words = process_file("Parts/words.txt", skip_header=False) # Ensure correct filename and path

print(hist)
print(f"Total number of words: {total_words(hist)}")
print(f"Number of different words: {different_words(hist)}")

t = most_common(hist, excluding_stopwords=True)
print("The most common words are:")
for freq, word in t[0:20]:
print(word, "\t", freq)

diff = subtract(hist, words)
print("The words in the book that aren't in the word list are:")
for word in diff.keys():
print(word, end=" ")

print("\n\nHere are some random words from the book")
for i in range(100):
print(random_word(hist), end=" ")


if __name__ == "__main__":
main()


# Putting them on a list to pickle
def main():
book_files = [
"Parts/East of Eden.txt",
"Parts/The Great Gatsby.txt"
]

histograms = []
for book in book_files:
hist = process_file(book, skip_header=True)
histograms.append(hist)
print(f"Processed '{book}':")
print(f" Total words: {total_words(hist)}")
print(f" Unique words: {different_words(hist)}\n")

# Example: most common words in each book
for i, hist in enumerate(histograms):
print(f"Most common words in Book {i+1}:")
top_words = most_common(hist, excluding_stopwords=True)
for freq, word in top_words[:10]:
print(f"{word}\t{freq}")
print("\n")

# Pickle
import pickle

# Assuming you already read the texts from files
with open('Parts/East of Eden.txt', 'r', encoding='utf-8') as f1:
east_of_eden_text = f1.read()

with open('Parts/The Great Gatsby.txt', 'r', encoding='utf-8') as f2:
great_gatsby_text = f2.read()

# Combine both into a dictionary
books = {
"East of Eden": east_of_eden_text,
"The Great Gatsby": great_gatsby_text
}

# Save data to a pickle file
with open('books_texts.pkl', 'wb') as f:
pickle.dump(books, f)

# Load data from the pickle file later
with open('books_texts.pkl', 'rb') as f:
reloaded_books = pickle.load(f)
Binary file added books_texts.pkl
Binary file not shown.
Binary file added image-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added image.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading