diff --git a/README.md b/README.md index 05aa109..3d13fec 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,44 @@ # Text-Analysis-Project Please read the [instructions](instructions.md). + +1. Project Overview (1 paragraph) + +As an avid reader and literature lover, I decided to choose Project Gutenberg as my source. While browsing throught the titles, I settled on the following books: + - A Tale of Two Cities by Charles Dickens + - The Great Gatsby by F. Scott Fitzgerald + - The Hound of the Baskervilles by Arthur Conan Doyle + - Little Women by Louisa May Alcott + - Peter Pan by J.M. Barrie + +My goal was to create a short text by combining words from each piece of literature, which I accomplished in the "combining-texts" file. +In addition to implementing Markov Text Synthesis, I analysed the frequency of words for each text in the "word-frequences" file. And lastly, I summarised the statistics of the top ten words that appeared in each book. + +2. Implementation (1-2 paragraphs) + +The implementation aspec that I struggled with the most was for the Markov Text section because I had very little idea about how to implement it. However, since I knew what my end goal was (to create short texts from multiple stories), I created a basic coding outline and asked GenAI for more intricate details. In the "images" folder I have attached photos of my interactions with tools like ChatGPT, Claude, or Microsoft Copilot. + +There were several times where I had to reevaluate my code and thought processes because I would think "I'm making it more complicated than it needs to be." An instance of this is when I attempted to introduce punctuation in the "combining-texts" file, which made the output texts expremely confusing and unappealing. Throughout the project, I also consulted with resources from our class (ex: recordings), GitHub, and platforms like Stack Overflow. + +3. Results (1-3 paragraphs + figures and examples) + +I loved the results for Homework 2 because they were creative and analytical. For example, I loved reading the different, twenty-five word texts generated by Markov's Text Synthesis. What made them unique is that they were composed by characters, plots, and verbs of each book, which made them incredibly unique. Here are a few examples: + +"your whole morning for nothing thought jo as she rang the bell half an hour later than usual and stood hot tired and dispirited surveying" + +"cigarette was trembling suddenly she threw the cigarette and the burning match on the carpet “oh you want too much ” she cried to gatsby" + +"of this and there s no reason you should all die of a surfeit because i ve been a fool cried amy wiping her eyes" + +What is really interesting about these examples is that they read as normal stories. It is true that there are some grammatical errors, but their storyline is consistent. I also find it amusing when different eras interact. For example, A Tale of Two Cities is written with an older version of english, while The Great Gatsby is more modern, so the outputs are sometimes a mix of both versions. + +Meanwhile, for the Word Frequencies Analysis, it helped me realised how some words can have such a great impact and be used scarcely, and vice versa. For instance,the word "dust" is only used six times in the story, but it is a crucial aspect of the plot because it is what allows them to fly. + +Regarding the Summary Statistics, the Top 10 words tended to be modal verbs (ex: could and would), character names, and the verb "said." + + +4. Reflection (1-2 paragraphs) + +Overall, I really enjoyed this project because it allowed me to see how Python can be used for Data Analysis. Throughout the process I ran into a few issues, especially when coding the Summary Statistics and Combining Texts (Markov Test). I kept running into issues because I would not get the output that I wanted. In the case of the Summary Statistics, I would get a list of the whole ranking with every word included. It wasn't until I submitted it to Claude that the reason for this was because I hadn't deleted "print frequencies" from the Word Frequencies section. Meanwhile, for the Markov Test, I didn't have a clear understanding as to how to code it, but because I had a vision I wanted to implement, I was able to ask Microsoft Copilot for help. I provided the code I had developed, my request, and any special comments (ex: "please keep the coding language simple so I can understand it"). Thanks to consulting with AI I have a deeper understanding as to how Text Analysis works and how to interpret code that makes it possible. + +Overall, I also experienced issues with the NLTK library because every time I attempted to download it, I kept getting a warning. Eventually, thanks to asking ChatGPT multiple times and rewatching class recordings, I got it to work. Although I struggled with certain concepts and invested a lot of time in debugging, my biggest takeaway from this project is that Python is really cool and versatile. Thanks to my project scope, my creative goal was never negatively influenced. \ No newline at end of file diff --git a/images/.DS_Store b/images/.DS_Store new file mode 100644 index 0000000..0e01d7d Binary files /dev/null and b/images/.DS_Store differ diff --git a/images/screenshots/chatgpt-question-screenshot copy.png b/images/screenshots/chatgpt-question-screenshot copy.png new file mode 100644 index 0000000..589f7f4 Binary files /dev/null and b/images/screenshots/chatgpt-question-screenshot copy.png differ diff --git a/images/screenshots/claude-question-screenshot.png b/images/screenshots/claude-question-screenshot.png new file mode 100644 index 0000000..693b103 Binary files /dev/null and b/images/screenshots/claude-question-screenshot.png differ diff --git a/images/screenshots/copilot-question-screenshot.png b/images/screenshots/copilot-question-screenshot.png new file mode 100644 index 0000000..4a60200 Binary files /dev/null and b/images/screenshots/copilot-question-screenshot.png differ diff --git a/markov-text-synthesis/combining-texts.py b/markov-text-synthesis/combining-texts.py new file mode 100644 index 0000000..dde04cc --- /dev/null +++ b/markov-text-synthesis/combining-texts.py @@ -0,0 +1,111 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords +from collections import defaultdict, Counter +import random + +# List of URLs of text files from Project Gutenberg +gutenberg_urls = [ + 'https://www.gutenberg.org/cache/epub/98/pg98.txt', + 'https://www.gutenberg.org/cache/epub/64317/pg64317.txt', + 'https://www.gutenberg.org/cache/epub/2852/pg2852.txt', + 'https://www.gutenberg.org/cache/epub/37106/pg37106.txt', + 'https://www.gutenberg.org/cache/epub/16/pg16.txt' +] + +# Headers to seem like a browser request +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure the stopwords corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create an unverified SSL context +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove the Project Gutenberg header and footer from the text.""" + + # Find the start of the main content + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end of the main content + end = re.search(r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*|\[Illustration\]", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def find_words(text): + """Convert text to lowercase, remove punctuation, and split into words.""" + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with spaces + for char in '.,!?;:()[]{}""\'-_': + text = text.replace(char, ' ') + + # Remove Roman numerals + text = re.sub(r'\b(?:[IVXLCDM]+)\b', '', text) + + return text.split() + +def create_markov_model(tokens, n=1): + """Create a Markov model. AI helped a lot with this section because I did not know what to do""" + + model = defaultdict(Counter) + for i in range(len(tokens) - n): + state = tuple(tokens[i:i+n]) + next_state = tokens [i+n] + model[state] [next_state] += 1 + return model + +def generate_text(model, length=100, n=1): + """Generate text using the Markov model.""" + + # Random starting state + state = random.choice(list(model.keys())) + output = list(state) + + for _ in range(length - n): + # If the state is not in the model or has no next states, choose a new random state + if state not in model or not model [state]: + state = random.choice(list(model.keys())) + + # Choose the next state based on the probabilities in the model + next_state = random.choices(list(model[state].keys()), list(model[state].values()))[0] + output.append(next_state) + state = tuple(output[-n:]) + + return ' '.join(output) + +def main(): + all_tokens = [] + + # Process each URL in the list, clean it, and print any errors if necessary + for url in gutenberg_urls: + req = urllib.request.Request(url, headers=headers) + try: + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + clean_text = strip_headers(raw_text) + tokens = find_words(clean_text) + all_tokens.extend(tokens) + except Exception as e: + print(f"An error occurred with {url}: {e}") + + # Create a Markov model and generate text + markov_model = create_markov_model(all_tokens, n=5) + generated_text = generate_text(markov_model, length=25, n=5) + print(generated_text) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/summary-statistics/a-tale-of-two-cities-ss.py b/summary-statistics/a-tale-of-two-cities-ss.py new file mode 100644 index 0000000..268ed6c --- /dev/null +++ b/summary-statistics/a-tale-of-two-cities-ss.py @@ -0,0 +1,85 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords +from collections import Counter + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/98/pg98.txt' + +# Headers to seem like a browser request +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure stopword corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create request object with specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove the Project Gutenberg header and footer from the text.""" + + # Find the start of the main content + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end of the main content + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with spaces + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords and single-character words + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words and len(word) > 1] + + return Counter(words) + +def main(): + try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Get the top 10 most frequent words + top_10 = frequencies.most_common(10) + print(f"Length of top_10: {len(top_10)}") + print("Top 10 most frequent words:") + for word, count in top_10: + print(f"{word}: {count}") + + except Exception as e: + # Print any errors that occur + print("An error occurred:", e) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/summary-statistics/great-gatsby-ss.py b/summary-statistics/great-gatsby-ss.py new file mode 100644 index 0000000..d10c994 --- /dev/null +++ b/summary-statistics/great-gatsby-ss.py @@ -0,0 +1,85 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords +from collections import Counter + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/64317/pg64317.txt' + +# Headers to seem like a browser request +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure stopword corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create request object with specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove the Project Gutenberg header and footer from the text.""" + + # Find the start of the main content + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end of the main content + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with spaces + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords and single-character words + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words and len(word) > 1] + + return Counter(words) + +def main(): + try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Get the top 10 most frequent words + top_10 = frequencies.most_common(10) + print(f"Length of top_10: {len(top_10)}") + print("Top 10 most frequent words:") + for word, count in top_10: + print(f"{word}: {count}") + + except Exception as e: + # Print any errors that occur + print("An error occurred:", e) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/summary-statistics/hound-baskerville-ss.py b/summary-statistics/hound-baskerville-ss.py new file mode 100644 index 0000000..f671fed --- /dev/null +++ b/summary-statistics/hound-baskerville-ss.py @@ -0,0 +1,87 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords +from collections import Counter + +nltk.download('stopwords') + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/2852/pg2852.txt' + +# Headers to seem like a browser request +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure stopword corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create request object with specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove the Project Gutenberg header and footer from the text.""" + + # Find the start of the main content + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end of the main content + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with spaces + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords and single-character words + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words and len(word) > 1] + + return Counter(words) + +def main(): + try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Get the top 10 most frequent words + top_10 = frequencies.most_common(10) + print(f"Length of top_10: {len(top_10)}") + print("Top 10 most frequent words:") + for word, count in top_10: + print(f"{word}: {count}") + + except Exception as e: + # Print any errors that occur + print("An error occurred:", e) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/summary-statistics/little-women-ss.py b/summary-statistics/little-women-ss.py new file mode 100644 index 0000000..f75c080 --- /dev/null +++ b/summary-statistics/little-women-ss.py @@ -0,0 +1,87 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords +from collections import Counter + +nltk.download('stopwords') + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/37106/pg37106.txt' + +# Headers to seem like a browser request +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure stopword corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create request object with specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove the Project Gutenberg header and footer from the text.""" + + # Find the start of the main content + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end of the main content + end = re.search (r"\[Illustration\]", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with spaces + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords and single-character words + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words and len(word) > 1] + + return Counter(words) + +def main(): + try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Get the top 10 most frequent words + top_10 = frequencies.most_common(10) + print(f"Length of top_10: {len(top_10)}") + print("Top 10 most frequent words:") + for word, count in top_10: + print(f"{word}: {count}") + + except Exception as e: + # Print any errors that occur + print("An error occurred:", e) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/summary-statistics/peter-pan-ss.py b/summary-statistics/peter-pan-ss.py new file mode 100644 index 0000000..3395ef6 --- /dev/null +++ b/summary-statistics/peter-pan-ss.py @@ -0,0 +1,87 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords +from collections import Counter + +nltk.download('stopwords') + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/16/pg16.txt' + +# Headers to seem like a browser request +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure stopword corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create request object with specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove the Project Gutenberg header and footer from the text.""" + + # Find the start of the main content + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end of the main content + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with spaces + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords and single-character words + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words and len(word) > 1] + + return Counter(words) + +def main(): + try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Get the top 10 most frequent words + top_10 = frequencies.most_common(10) + print(f"Length of top_10: {len(top_10)}") + print("Top 10 most frequent words:") + for word, count in top_10: + print(f"{word}: {count}") + + except Exception as e: + # Print any errors that occur + print("An error occurred:", e) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/word-frequencies/a-tale-of-two-cities.py b/word-frequencies/a-tale-of-two-cities.py new file mode 100644 index 0000000..be3aa34 --- /dev/null +++ b/word-frequencies/a-tale-of-two-cities.py @@ -0,0 +1,91 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/98/pg98.txt' + + +# Headers to seem like a browser request. Otherwise, I was encountering issues with the output. +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure the stopwords corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create a request object with the specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context. Code wouldn't run otherwise +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove header and footer from the text""" + + # Find the start + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with space + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words] + + # Count word frequencies + word_counts = {} + for word in words: + if word in word_counts: + word_counts[word] += 1 + else: + word_counts[word] = 1 + + return word_counts + +def main(): + try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + print(clean_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Print word frequencies + print(frequencies) + + # Print any errors that occur + except Exception as e: + + print("An error occurred:", e) + +if __name__ == "__main__": + main() diff --git a/word-frequencies/great-gatsby.py b/word-frequencies/great-gatsby.py new file mode 100644 index 0000000..df8559d --- /dev/null +++ b/word-frequencies/great-gatsby.py @@ -0,0 +1,87 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/64317/pg64317.txt' + +# Headers to seem like a browser request. Otherwise, I was encountering issues with the output. +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure the stopwords corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create a request object with the specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context. Code wouldn't run otherwise +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove header and footer from the text""" + + # Find the start + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with space + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words] + + # Count word frequencies + word_counts = {} + for word in words: + if word in word_counts: + word_counts[word] += 1 + else: + word_counts[word] = 1 + + return word_counts + +try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + print(clean_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Print word frequencies + print(frequencies) + +# Print any errors that occur +except Exception as e: + + print("An error occurred:", e) + diff --git a/word-frequencies/hound-baskervilles b/word-frequencies/hound-baskervilles new file mode 100644 index 0000000..a3ad521 --- /dev/null +++ b/word-frequencies/hound-baskervilles @@ -0,0 +1,89 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords + +nltk.download('stopwords') + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/2852/pg2852.txt' + +# Headers to seem like a browser request. Otherwise, I was encountering issues with the output. +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure the stopwords corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create a request object with the specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context. Code wouldn't run otherwise +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove header and footer from the text""" + + # Find the start + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with space + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words] + + # Count word frequencies + word_counts = {} + for word in words: + if word in word_counts: + word_counts[word] += 1 + else: + word_counts[word] = 1 + + return word_counts + +try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + print(clean_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Print word frequencies + print(frequencies) + +# Print any errors that occur +except Exception as e: + + print("An error occurred:", e) + diff --git a/word-frequencies/little-women.py b/word-frequencies/little-women.py new file mode 100644 index 0000000..c2b0890 --- /dev/null +++ b/word-frequencies/little-women.py @@ -0,0 +1,89 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords + +nltk.download('stopwords') + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/37106/pg37106.txt' + +# Headers to seem like a browser request. Otherwise, I was encountering issues with the output. +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure the stopwords corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create a request object with the specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context. Code wouldn't run otherwise +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove header and footer from the text""" + + # Find the start + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end + end = re.search (r"\[Illustration\]", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with space + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words] + + # Count word frequencies + word_counts = {} + for word in words: + if word in word_counts: + word_counts[word] += 1 + else: + word_counts[word] = 1 + + return word_counts + +try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + print(clean_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Print word frequencies + print(frequencies) + +# Print any errors that occur +except Exception as e: + + print("An error occurred:", e) + diff --git a/word-frequencies/peter-pan.py b/word-frequencies/peter-pan.py new file mode 100644 index 0000000..cf1ab27 --- /dev/null +++ b/word-frequencies/peter-pan.py @@ -0,0 +1,89 @@ +import urllib.request +import ssl +import re +import nltk +from nltk.corpus import stopwords + +nltk.download('stopwords') + +# URL of the text file from Project Gutenberg +url = 'https://www.gutenberg.org/cache/epub/16/pg16.txt' + +# Headers to seem like a browser request. Otherwise, I was encountering issues with the output. +headers = { + "User-Agent": "Mozilla/5.0" +} + +# Ensure the stopwords corpus is downloaded +try: + stopwords.words('english') +except LookupError: + nltk.download('stopwords') + +# Create a request object with the specified URL and headers +req = urllib.request.Request(url, headers = headers) + +# Create an unverified SSL context. Code wouldn't run otherwise +context = ssl._create_unverified_context() + +def strip_headers(text): + """Remove header and footer from the text""" + + # Find the start + start = re.search(r"\*\*\* START OF.*PROJECT GUTENBERG EBOOK .* \*\*\*", text, re.IGNORECASE) + if start: + text = text[start.end():] + + # Find the end + end = re.search (r"\*\*\* END OF.*PROJECT GUTENBERG EBOOK.* \*\*\*", text, re.IGNORECASE) + if end: + text = text[:end.start()] + + return text.strip() + +def word_frequencies(text): + """Calculate the frequency of each word in the text, excluding stopwords.""" + + # Convert text to lowercase + text = text.lower() + + # Replace punctuation with space + for char in '.,!?;:()[]{}"\'-_': + text = text.replace(char, ' ') + + # Split text into words + words = text.split() + + # Remove stopwords + stop_words = set(stopwords.words('english')) + words = [word for word in words if word not in stop_words] + + # Count word frequencies + word_counts = {} + for word in words: + if word in word_counts: + word_counts[word] += 1 + else: + word_counts[word] = 1 + + return word_counts + +try: + # Open the URL and read the raw text + with urllib.request.urlopen(req, context=context) as f: + raw_text = f.read().decode('utf-8') + # Remove headers and footers from the text + clean_text = strip_headers(raw_text) + print(clean_text) + + # Calculate word frequencies + frequencies = word_frequencies(clean_text) + + # Print word frequencies + print(frequencies) + +# Print any errors that occur +except Exception as e: + + print("An error occurred:", e) +