diff --git a/README.md b/README.md index 05aa109..4fcf64e 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,71 @@ # Text-Analysis-Project Please read the [instructions](instructions.md). + +Project Overview + I decided to use the histograms and Project Gutenburg to analyze and reak down two books. The books being East of Eden and The Great Gatsby. I also decided to look into movie reviews for the same books. I was hoping to learn more about API and data mining. I was also curios on looking into two of my favorite books + +Implementation + The first part of my code loads and downloads the books into text and creates a list of stop words which I will use for data mininig. I then created historgrams and see which words appear the most (excluding stop words). This all fits togeher as i can compare the books and furhter analyze them such as which words are used in each book I also pickled my data by trying to add the texts to one list. This allowed my to have the two texts in one code. + + The next part of my code was the movie reviews but I could not get it to work. I have used AI to fix my code and ask for inut, howver, AI beleives it is a library error. I also experimented with the OPen AI API but faced challenges with API authentication. I asked AI to see where the issue was and since it was my key. It asked me to implement my key to a .emv dile on gitignore. Then import the libraries. After importing these, my key was working. + +Results + Some of my results are the top twenty words from each book +The top 20 words in East of Eden are: + +eva: 463 +us: 203 +“i: 193 +over: 192 +might: 188 +did: 187 +know: 169 +like: 163 +see: 163 +van: 162 +your: 154 +must: 154 +eyes: 147 +never: 143 +nicholas: 142 +because: 142 +work: 139 +down: 139 +say: 138 + +The top 20 words in The Great Gatsby are: +gatsby: 189 +“i: 175 +tom: 173 +daisy: 143 +over: 141 +like: 118 +down: 118 +came: 108 +back: 108 +man: 105 +any: 102 +little: 102 +know: 96 +just: 94 +house: 92 +before: 91 +now: 91 +went: 91 +after: 86 + +I also have a MDS scatterplot that shows the correlation of 5 chunchs for the two books. I used fuzzy to show the similarities from 0 to 1where 0 is no correlatoin and 1 is exactly the same. This is the graph +![alt text](image.png) +![alt text](image-4.png) +For the pickling, I was able to see the two books into one code Howver, I orginilnally did not have the list, so i asked AI to walk me thorugh the list. Then, fix my errors. Here is some inputs +![alt text](image-1.png) +![alt text](image-2.png) + +I also used Claude AI for making sure my code is correct since I did try to implement extra steps. However this was more complicated. Here, i asked Claude to help me with the movie reviews before messaging proffesor +![alt text](image-3.png) + +Reflection + From my point of view, there were a lot of things that went well, such as extracting and getting the books into a histogram, however, some things did not go the way I expected. My first issue was trying to remove the punctuations and spaces and special characters. I asked Ai for this but got something complicated, So i decedd to look for a simpler solution. I also asked my brother to assit me and looking over my code and giving me suggestions. Howeverm his feedback was outside my knowledge learned in OIM 3640 + Another big challenge was understanding the structure of the Cinemagoer. I orignially did not want to look into movies as I was already having problems in Part 1. However, I wanted to challenge myself and get a sense of accomplishment by going over the movies of the two books. One thing I learned was properly using the .get function with default values to prevent crashes when keys were not present. I also wanted to try the NPL with the charts. This was very sucessful after I learned how to merge a list into a dictionary and then pickle. I do beleive my project was scoped to a small extent as it was simple and effective. I did not feel too comfortable doing something complex. My testing plan was commenting out parts of my code and then running what I had. + My biggest Takeaway was learning how to process and analyze text data into NPL. I also learned how to tokenize and extract specific data such as stop words. I also learned how to organize text to create dictionaries andreveal patterns such as similar words. The biggest thing however, was learning about each external resrouces and how to handle such as Reddit and Cinemagoer libraries. Going forward, I will use more NPL techniques and try to get developer accounts for reddit and twitter. I will also try to compare the book and movies to each otehr rather than the two books to each other. I wish I studied more and looked over the JSOn files and started earlier \ No newline at end of file diff --git a/analyze.py b/analyze.py new file mode 100644 index 0000000..86a4ee6 --- /dev/null +++ b/analyze.py @@ -0,0 +1,219 @@ +import random +import string +import sys +from unicodedata import category + + + +def process_file(filename, skip_header): + """Makes a histogram that counts the words from a file. + + filename: string + skip_header: boolean, whether to skip the Gutenberg header + + returns: map from each word to the number of times it appears. + """ + hist = {} + fp = open(filename, encoding="utf-8") + + if skip_header: + skip_gutenberg_header(fp) + + # strippables = string.punctuation + string.whitespace + strippables = "".join( + chr(i) for i in range(sys.maxunicode) if category(chr(i)).startswith("P") + ) # Unicode punctuation characters. Ref: https://stackoverflow.com/a/60983895 + + for line in fp: + if line.startswith("*** END OF THE PROJECT"): + break + + line = line.replace("-", " ") + line = line.replace(chr(8212), " ") # Em dash replacement + + for word in line.split(): + word = word.strip(strippables) + word = word.lower() + + hist[word] = hist.get(word, 0) + 1 + + fp.close() + return hist + + +def skip_gutenberg_header(fp): + """Reads from fp until it finds the line that ends the header. + + fp: open file object + """ + start_marker = "START OF THE PROJECT" + + for line in fp: + if start_marker.lower() in line.lower(): # Case-insensitive search + return + # If the loop completes without finding the start marker + raise ValueError(f"Header end marker '{start_marker}' not found in file.") + + +def total_words(hist): + """Returns the total of the frequencies in a histogram.""" + return sum(hist.values()) + + +def different_words(hist): + """Returns the number of different words in a histogram.""" + return len(hist) + + +def most_common(hist, excluding_stopwords=False): + """Makes a list of word-freq pairs in descending order of frequency. + + hist: map from word to frequency + + returns: list of (frequency, word) pairs + """ + stopwords = set() + if excluding_stopwords: + stopwords = { + "a", + "and", + "at", "as", + "be", + "but", + "by", + "for", + "had", + "he", + "her", + "his", + "i", + "in", + "is", + "it", + "so", + "that", + "the", + "them", + "to", + "with", + "which", + } + # i searched up some stop words + t = [] + for word, freq in hist.items(): + if excluding_stopwords and word in stopwords: + continue + t.append((freq, word)) + + t.sort(reverse=True) + return t + + +def print_most_common(hist, num=10): + """Prints the most commons words in a histgram and their frequencies. + + hist: histogram (map from word to frequency) + num: number of words to print + """ + common = most_common(hist) + for freq, word in common[:num]: + print(word, "\t", freq) + + +def subtract(d1, d2): + """Returns a dictionary with all keys that appear in d1 but not d2. + + d1, d2: dictionaries + """ + result = {} + for key in d1: + if key not in d2: + result[key] = d1[key] + return result + + +def random_word(hist): + """Chooses a random word from a histogram. + + The probability of each word is proportional to its frequency. + """ + words = [] + for word, freq in hist.items(): + words.extend([word] * freq) + return random.choice(words) + + +def main(): + # This text file is downloaded from gutenberg.org (https://www.gutenberg.org/cache/epub/1342/pg1342.txt) + hist = process_file("Parts/East of Eden.txt", skip_header=True) + words = process_file("Parts/words.txt", skip_header=False) # Ensure correct filename and path + + print(hist) + print(f"Total number of words: {total_words(hist)}") + print(f"Number of different words: {different_words(hist)}") + + t = most_common(hist, excluding_stopwords=True) + print("The most common words are:") + for freq, word in t[0:20]: + print(word, "\t", freq) + + diff = subtract(hist, words) + print("The words in the book that aren't in the word list are:") + for word in diff.keys(): + print(word, end=" ") + + print("\n\nHere are some random words from the book") + for i in range(100): + print(random_word(hist), end=" ") + + +if __name__ == "__main__": + main() + + +# Putting them on a list to pickle +def main(): + book_files = [ + "Parts/East of Eden.txt", + "Parts/The Great Gatsby.txt" + ] + + histograms = [] + for book in book_files: + hist = process_file(book, skip_header=True) + histograms.append(hist) + print(f"Processed '{book}':") + print(f" Total words: {total_words(hist)}") + print(f" Unique words: {different_words(hist)}\n") + + # Example: most common words in each book + for i, hist in enumerate(histograms): + print(f"Most common words in Book {i+1}:") + top_words = most_common(hist, excluding_stopwords=True) + for freq, word in top_words[:10]: + print(f"{word}\t{freq}") + print("\n") + +# Pickle +import pickle + +# Assuming you already read the texts from files +with open('Parts/East of Eden.txt', 'r', encoding='utf-8') as f1: + east_of_eden_text = f1.read() + +with open('Parts/The Great Gatsby.txt', 'r', encoding='utf-8') as f2: + great_gatsby_text = f2.read() + +# Combine both into a dictionary +books = { + "East of Eden": east_of_eden_text, + "The Great Gatsby": great_gatsby_text +} + +# Save data to a pickle file +with open('books_texts.pkl', 'wb') as f: + pickle.dump(books, f) + +# Load data from the pickle file later +with open('books_texts.pkl', 'rb') as f: + reloaded_books = pickle.load(f) diff --git a/books_texts.pkl b/books_texts.pkl new file mode 100644 index 0000000..f4c6a43 Binary files /dev/null and b/books_texts.pkl differ diff --git a/image-1.png b/image-1.png new file mode 100644 index 0000000..1177b6e Binary files /dev/null and b/image-1.png differ diff --git a/image-2.png b/image-2.png new file mode 100644 index 0000000..b5fd598 Binary files /dev/null and b/image-2.png differ diff --git a/image-3.png b/image-3.png new file mode 100644 index 0000000..817a6ab Binary files /dev/null and b/image-3.png differ diff --git a/image-4.png b/image-4.png new file mode 100644 index 0000000..196781f Binary files /dev/null and b/image-4.png differ diff --git a/image.png b/image.png new file mode 100644 index 0000000..3fd2a14 Binary files /dev/null and b/image.png differ diff --git a/part2.py b/part2.py new file mode 100644 index 0000000..86a4ee6 --- /dev/null +++ b/part2.py @@ -0,0 +1,219 @@ +import random +import string +import sys +from unicodedata import category + + + +def process_file(filename, skip_header): + """Makes a histogram that counts the words from a file. + + filename: string + skip_header: boolean, whether to skip the Gutenberg header + + returns: map from each word to the number of times it appears. + """ + hist = {} + fp = open(filename, encoding="utf-8") + + if skip_header: + skip_gutenberg_header(fp) + + # strippables = string.punctuation + string.whitespace + strippables = "".join( + chr(i) for i in range(sys.maxunicode) if category(chr(i)).startswith("P") + ) # Unicode punctuation characters. Ref: https://stackoverflow.com/a/60983895 + + for line in fp: + if line.startswith("*** END OF THE PROJECT"): + break + + line = line.replace("-", " ") + line = line.replace(chr(8212), " ") # Em dash replacement + + for word in line.split(): + word = word.strip(strippables) + word = word.lower() + + hist[word] = hist.get(word, 0) + 1 + + fp.close() + return hist + + +def skip_gutenberg_header(fp): + """Reads from fp until it finds the line that ends the header. + + fp: open file object + """ + start_marker = "START OF THE PROJECT" + + for line in fp: + if start_marker.lower() in line.lower(): # Case-insensitive search + return + # If the loop completes without finding the start marker + raise ValueError(f"Header end marker '{start_marker}' not found in file.") + + +def total_words(hist): + """Returns the total of the frequencies in a histogram.""" + return sum(hist.values()) + + +def different_words(hist): + """Returns the number of different words in a histogram.""" + return len(hist) + + +def most_common(hist, excluding_stopwords=False): + """Makes a list of word-freq pairs in descending order of frequency. + + hist: map from word to frequency + + returns: list of (frequency, word) pairs + """ + stopwords = set() + if excluding_stopwords: + stopwords = { + "a", + "and", + "at", "as", + "be", + "but", + "by", + "for", + "had", + "he", + "her", + "his", + "i", + "in", + "is", + "it", + "so", + "that", + "the", + "them", + "to", + "with", + "which", + } + # i searched up some stop words + t = [] + for word, freq in hist.items(): + if excluding_stopwords and word in stopwords: + continue + t.append((freq, word)) + + t.sort(reverse=True) + return t + + +def print_most_common(hist, num=10): + """Prints the most commons words in a histgram and their frequencies. + + hist: histogram (map from word to frequency) + num: number of words to print + """ + common = most_common(hist) + for freq, word in common[:num]: + print(word, "\t", freq) + + +def subtract(d1, d2): + """Returns a dictionary with all keys that appear in d1 but not d2. + + d1, d2: dictionaries + """ + result = {} + for key in d1: + if key not in d2: + result[key] = d1[key] + return result + + +def random_word(hist): + """Chooses a random word from a histogram. + + The probability of each word is proportional to its frequency. + """ + words = [] + for word, freq in hist.items(): + words.extend([word] * freq) + return random.choice(words) + + +def main(): + # This text file is downloaded from gutenberg.org (https://www.gutenberg.org/cache/epub/1342/pg1342.txt) + hist = process_file("Parts/East of Eden.txt", skip_header=True) + words = process_file("Parts/words.txt", skip_header=False) # Ensure correct filename and path + + print(hist) + print(f"Total number of words: {total_words(hist)}") + print(f"Number of different words: {different_words(hist)}") + + t = most_common(hist, excluding_stopwords=True) + print("The most common words are:") + for freq, word in t[0:20]: + print(word, "\t", freq) + + diff = subtract(hist, words) + print("The words in the book that aren't in the word list are:") + for word in diff.keys(): + print(word, end=" ") + + print("\n\nHere are some random words from the book") + for i in range(100): + print(random_word(hist), end=" ") + + +if __name__ == "__main__": + main() + + +# Putting them on a list to pickle +def main(): + book_files = [ + "Parts/East of Eden.txt", + "Parts/The Great Gatsby.txt" + ] + + histograms = [] + for book in book_files: + hist = process_file(book, skip_header=True) + histograms.append(hist) + print(f"Processed '{book}':") + print(f" Total words: {total_words(hist)}") + print(f" Unique words: {different_words(hist)}\n") + + # Example: most common words in each book + for i, hist in enumerate(histograms): + print(f"Most common words in Book {i+1}:") + top_words = most_common(hist, excluding_stopwords=True) + for freq, word in top_words[:10]: + print(f"{word}\t{freq}") + print("\n") + +# Pickle +import pickle + +# Assuming you already read the texts from files +with open('Parts/East of Eden.txt', 'r', encoding='utf-8') as f1: + east_of_eden_text = f1.read() + +with open('Parts/The Great Gatsby.txt', 'r', encoding='utf-8') as f2: + great_gatsby_text = f2.read() + +# Combine both into a dictionary +books = { + "East of Eden": east_of_eden_text, + "The Great Gatsby": great_gatsby_text +} + +# Save data to a pickle file +with open('books_texts.pkl', 'wb') as f: + pickle.dump(books, f) + +# Load data from the pickle file later +with open('books_texts.pkl', 'rb') as f: + reloaded_books = pickle.load(f)