diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 0000000..642ff51 --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,3 @@ +{ + "python.REPL.enableREPLSmartSend": false +} \ No newline at end of file diff --git a/README.md b/README.md index 05aa109..aa4cf33 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,33 @@ # Text-Analysis-Project -Please read the [instructions](instructions.md). +1. Project Overview + +For this project, I analyzed text from the Wikipedia article "The Neighbourhood (band)" which is a band I listen to regualry but did not know much about its origins nor much about the members of the band. I used the MediaWiki API to fetch the article content, then cleaned and processed the text to study word frequencies, summary statistics, and visual patterns. I also used NLKT Vader sentiment analyzer to see the overall consensus on what the wikipedia page views the band to be as I did see there was some controversy with this band. I also implemented a Markov text generator to create new, band-style text based on the original article. My goal was to practice accessing data programmatically, working with text data structures, and experimenting with basic language models while working with data that truly fascinated and resonated with me + +2. Implementation . + +For this project, I used Python to analyze text from the Wikipedia article “The Neighbourhood (band)”. I started by using the MediaWiki library to access the article and then saved the data locally using the pickle module so I wouldn’t have to redownload it each time. Once I had the text, I cleaned it by converting everything to lowercase, removing punctuation, numbers, and extra spaces. I also removed common stop words like “the” and “and” using NLTK, which helped me focus on more meaningful words that described the band. After that, I created a dictionary-based word frequency counter to see which words appeared most often, and I wrote a summary function to calculate things like total words, unique words, and average word length. + +To make the results easier to understand, I created visualizations using matplotlib and the wordcloud library, which clearly showed that the article focused on the band’s albums, songs, and members. I also used NLTK’s VADER sentiment analyzer to check if the article described the band in a positive, negative, or neutral way. As a final step, I built a Markov chain text generator that created new sentences based on the band’s Wikipedia page. The generated text wasn’t perfect, but it was fun to see how the model imitated the article’s tone and vocabulary. Throughout the project, I learned a lot about working with APIs, cleaning data, and experimenting with text-based models, all while exploring a topic that genuinely interested me. + +3. Results + +After cleaning and analyzing the text from The Neighbourhood (band) Wikipedia article, several interesting patterns appeared. The most frequent words were band, released, album, lead, love, single, imagine, and music, which makes sense given how much of the article focuses on the band’s discography and musical style. The bar chart clearly highlighted these recurring terms, while the word cloud visually emphasized the same ideas, showing that the Wikipedia page revolves heavily around their albums, releases, and artistic identity. + +The sentiment analysis using NLTK’s VADER tool produced the following results: + +neg: 0.019 +neu: 0.911 +pos: 0.071 +compound: 0.9959 +Overall Sentiment: 😊 Positive + +These scores indicate that the overall tone of the article is highly positive, with very little negativity. This suggests that the band is presented favorably on Wikipedia, focusing more on their creative output and achievements rather than the it has faced. + +For the Markov text synthesis, I generated new text based on the cleaned article. The results were often jumbled but still recognizable, repeating words like “album,” “released,” and “imagine” in random but musically themed combinations. For example, one output read: “Revenge serious august premiered music video released march included tracks previous extended plays including lead single scary love...” Another example produced: “Band april neighbourhood announced summer tour called love collection tour along lovelife jmsn well planning release mixtape december released new ep called hard imagine...” While some of the generated sentences don’t follow clear grammar or structure, they reflect how the Markov model mimics the repetitive phrasing and musical terminology from the original article. This shows how text generation models can capture patterns of language without fully understanding context, producing interesting and sometimes chaotic outputs that still sound like something a band’s Wikipedia might say. + +4. Reflection + +Overall, this project went really well and helped me gain a deeper understanding of how to work with text data in Python. I was proud of how I broke the project into smaller functions, which made it easier to test and fix errors along the way. The biggest challenge was cleaning and preprocessing the text correctly — especially when I was first getting random outputs or symbols that didn’t make sense. Another challenge was using a .py file as it is a bit different than the juypter files I usually use. I had to understand the terminal section more as I was downloading libraries and especially learning the difference between running commands in the Python terminal, PowerShell, and my computer’s regular terminal. At first, I kept getting errors like “module not found,” and I didn’t realize that I was entering commands in the wrong place. Once I learned to use pip install correctly inside the right environment, everything started running smoothly. If I were to improve the project, I’d try to generate text from multiple band articles and compare how their styles differ. This would allow me to do comparison visualization models as well. + +From a learning perspective, my biggest takeaway was realizing how much potential text analysis has for exploring culture, music, and media through data. Before this project, I had never used APIs or performed text-based sentiment analysis, so learning how to fetch, clean, and visualize data felt like a big step forward. Using AI tools helped me understand new libraries like MediaWiki, NLTK, and wordcloud faster and guided me when I got stuck. Going forward, I feel more confident about applying Python to real-world datasets, especially when combining data science and creativity. I also learned that debugging patiently and documenting my process through comments and doc strings made the work much smoother because I was able to recall my steps faster. This assignment definitely tested my patience with the bugs I had to fix or modifying the code so it runs how I inteded it to and this is something I’ll definitely carry into future projects. diff --git a/textassignment.py b/textassignment.py new file mode 100644 index 0000000..fe24e5f --- /dev/null +++ b/textassignment.py @@ -0,0 +1,309 @@ +####################################################################################################### +#starting code for text assignment +#importing library +from mediawiki import MediaWiki + +#making the pull of text a function so i can reuse later with less lines of code +def get_text(topic): + wikipedia = MediaWiki() + page = wikipedia.page(topic) + return page.content + +if __name__ == "__main__": + text = get_text("The Neighbourhood (band)") #one of my fav bands + print(text[:1000]) + #making sure package works by printing first 1000 characters of the text + +#pickling the text for later use +import pickle + +#https://wiki.python.org/moin/UsingPickle (used this to help) +with open("the_neighbourhood_band.pkl", "wb") as f: + pickle.dump(text, f) +####################################################################################################### +#text cleaning and preprocessing +import re +import string + +def clean_text(text): + """cleans and preprocesses text for analysis""" + # lowercase the text + text = text.lower() + + # remove punctuation + text = text.translate(str.maketrans('', '', string.punctuation)) + + # remove numbers and other symbols using regex + text = re.sub(r'\d+', '', text) + + # remove extra whitespace + text = re.sub(r'\s+', ' ', text).strip() + + # split into words (tokenization, learned from machine learning class for easier data manipulation of text) + words = text.split() + + return words + +# testing the function +words = clean_text(text) +print(words[:30]) # first 30 cleaned words +####################################################################################################### +#removing stop words +#importing nltk package (used https://www.nltk.org/book/ch01.html as reference) + +import nltk +from nltk.corpus import stopwords + +nltk.download('stopwords') + +def remove_stopwords(words): + """removes common stop words from a list of words""" + stop_words = set(stopwords.words('english')) + filtered = [w for w in words if w not in stop_words] + return filtered + +# testing the function +words = clean_text(text) +print("Before stopword removal:", len(words), "words") + +filtered_words = remove_stopwords(words) +print("After stopword removal:", len(filtered_words), "words") +print(filtered_words[:30]) +####################################################################################################### +#word frequency analysis +#using a dictionary function to do this +def word_frequency(words): + """count word frequencies using a dictionary""" + freq = {} # empty dictionary + + for word in words: + if word in freq: + freq[word] += 1 # if word exists, increment count + else: + freq[word] = 1 # if new word, set count to 1 + + return freq + +freq = word_frequency(filtered_words) + +#print all word freqs +for word, count in freq.items(): + print(f"{word}: {count}") + +###################################################################################################### +#summary statistics + +def summary_statistics(text, words, freq): + """compute and print key summary statistics for the text""" + + # document length (in words) + total_words = len(words) + + # unique words + unique_words = len(set(words)) + + # vocabulary richness (unique / total) + vocab_richness = unique_words / total_words if total_words > 0 else 0 + + # average word length + avg_word_length = sum(len(word) for word in words) / total_words if total_words > 0 else 0 + + # sentence stats (split text by punctuation) + sentences = re.split(r'[.!?]+', text) + sentences = [s.strip() for s in sentences if s.strip() != ""] + total_sentences = len(sentences) + avg_sentence_length = total_words / total_sentences if total_sentences > 0 else 0 + + # top 10 frequent words + sorted_freq = dict(sorted(freq.items(), key=lambda item: item[1], reverse=True)) + top_10 = list(sorted_freq.items())[:10] + + #printing all results + print("\nthe neighbourhood text stats:") + print(f"Total words (after cleaning): {total_words}") + print(f"Unique words: {unique_words}") + print(f"Vocabulary richness: {vocab_richness:.3f}") + print(f"Average word length: {avg_word_length:.2f}") + print(f"Total sentences: {total_sentences}") + print(f"Average sentence length (words): {avg_sentence_length:.2f}") + print("\nTop 10 most frequent words:") + for word, count in top_10: + print(f" {word}: {count}") + +#calling function to print summary stats +summary_statistics(text, filtered_words, freq) +###################################################################################################### +#data visualization + +import matplotlib.pyplot as plt +from wordcloud import WordCloud + +def plot_top_words(freq, n=10): + """ + visualize the top 10 most frequent words as a bar chart. + + learned from: + https://matplotlib.org/stable/tutorials/introductory/pyplot.html + and https://www.w3schools.com/python/matplotlib_intro.asp + and used ai agent help to format code properly + """ + # sort the frequency dictionary by value (descending order) + sorted_freq = dict(sorted(freq.items(), key=lambda item: item[1], reverse=True)) + top_items = list(sorted_freq.items())[:n] + words, counts = zip(*top_items) + + # create bar chart + plt.figure(figsize=(10, 6)) + plt.bar(words, counts, color='skyblue') + plt.title(f"Top {n} Most Frequent Words") + plt.xlabel("Words") + plt.ylabel("Frequency") + plt.xticks(rotation=45, ha="right") + plt.tight_layout() + plt.show() + + +def make_wordcloud(freq): + """ + Generate and display a word cloud from word frequencies. + + Technique inspired by: + https://amueller.github.io/word_cloud/ + and https://www.geeksforgeeks.org/generating-word-cloud-python/ + """ + wc = WordCloud( + width=800, + height=400, + background_color='white', + max_words=200 + ).generate_from_frequencies(freq) + + plt.figure(figsize=(10, 6)) + plt.imshow(wc, interpolation='bilinear') + plt.axis("off") + plt.title("Word Cloud of Most Frequent Words", fontsize=16) + plt.show() + +#calling functions to show chart outputs +plot_top_words(freq, n=10) +make_wordcloud(freq) +####################################################################################################### +#advanced technique - vader from nltk and markov + +#used these sources to help with vader sentiment analysis +#https://www.nltk.org/_modules/nltk/sentiment/vader.html +#https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/ + +#technique 1 : sentiment analysis with vader +import nltk +nltk.download('vader_lexicon') +from nltk.sentiment.vader import SentimentIntensityAnalyzer + +def analyze_sentiment(text): + """analyze sentiment of the text using NLTK's VADER.""" + sia = SentimentIntensityAnalyzer() + scores = sia.polarity_scores(text) + + print("\nsentiment analysis results:") + for k, v in scores.items(): + print(f"{k}: {v}") + + # basic interpretation + if scores["compound"] >= 0.05: + print("\nOverall Sentiment: 😊 Positive") + elif scores["compound"] <= -0.05: + print("\nOverall Sentiment: 😠 Negative") + else: + print("\nOverall Sentiment: 😐 Neutral") + +#calling the function to analyze sentiment +analyze_sentiment(text) + +#technique 2 : markov text generation + +# sources used +# "Think Python" by Allen B. Downey (Markov example) +# Real Python Markov Tutorial: https://realpython.com/markov-chains/ +# used chatgpt to explain and help with this one + + +import random + +def build_markov_chain(words, n=2): + """ + build a Markov chain as a dictionary mapping tuples of 'n' words to possible next words + """ + markov_chain = {} + for i in range(len(words) - n): + key = tuple(words[i:i + n]) + next_word = words[i + n] + if key not in markov_chain: + markov_chain[key] = [] + markov_chain[key].append(next_word) + return markov_chain + + +def generate_markov_text(chain, num_words=100): + """ + generate new text using the Markov chain. + randomly starts from one of the keys and keeps selecting next words based on the learned transitions. + """ + + start_key = random.choice(list(chain.keys())) + result = list(start_key) + + for _ in range(num_words - len(start_key)): + next_words = chain.get(tuple(result[-len(start_key):]), None) + if not next_words: + break + next_word = random.choice(next_words) + result.append(next_word) + + return ' '.join(result) + +#printing output +print("\n generating markov text \n") +chain = build_markov_chain(filtered_words, n=2) +generated_text = generate_markov_text(chain, num_words=100) +print(generated_text) +####################################################################################################### +#main function to run all parts of the project +def main(): + # get Wikipedia text + text = get_text("The Neighbourhood (band)") # one of my fav bands + print(text[:1000]) # sanity check + + # pickle text for later use + with open("the_neighbourhood_band.pkl", "wb") as f: + pickle.dump(text, f) + + # clean and preprocess + words = clean_text(text) + print(words[:30]) + + # remove stopwords + print("Before stopword removal:", len(words)) + filtered_words = remove_stopwords(words) + print("After stopword removal:", len(filtered_words)) + print(filtered_words[:30]) + + # word frequency + summary stats + freq = word_frequency(filtered_words) + summary_statistics(text, filtered_words, freq) + + # visualizations + plot_top_words(freq, n=10) + make_wordcloud(freq) + + # sentiment analysis + analyze_sentiment(text) + + # markov text synthesis + print("\n--- Generating Markov Text ---\n") + chain = build_markov_chain(filtered_words, n=2) + generated_text = generate_markov_text(chain, num_words=100) + print(generated_text) + +# run the project +if __name__ == "__main__": + main() diff --git a/the_neighbourhood_band.pkl b/the_neighbourhood_band.pkl new file mode 100644 index 0000000..d48b1fc Binary files /dev/null and b/the_neighbourhood_band.pkl differ