Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"python.REPL.enableREPLSmartSend": false
}
32 changes: 31 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,33 @@
# Text-Analysis-Project

Please read the [instructions](instructions.md).
1. Project Overview

For this project, I analyzed text from the Wikipedia article "The Neighbourhood (band)" which is a band I listen to regualry but did not know much about its origins nor much about the members of the band. I used the MediaWiki API to fetch the article content, then cleaned and processed the text to study word frequencies, summary statistics, and visual patterns. I also used NLKT Vader sentiment analyzer to see the overall consensus on what the wikipedia page views the band to be as I did see there was some controversy with this band. I also implemented a Markov text generator to create new, band-style text based on the original article. My goal was to practice accessing data programmatically, working with text data structures, and experimenting with basic language models while working with data that truly fascinated and resonated with me

2. Implementation .

For this project, I used Python to analyze text from the Wikipedia article “The Neighbourhood (band)”. I started by using the MediaWiki library to access the article and then saved the data locally using the pickle module so I wouldn’t have to redownload it each time. Once I had the text, I cleaned it by converting everything to lowercase, removing punctuation, numbers, and extra spaces. I also removed common stop words like “the” and “and” using NLTK, which helped me focus on more meaningful words that described the band. After that, I created a dictionary-based word frequency counter to see which words appeared most often, and I wrote a summary function to calculate things like total words, unique words, and average word length.

To make the results easier to understand, I created visualizations using matplotlib and the wordcloud library, which clearly showed that the article focused on the band’s albums, songs, and members. I also used NLTK’s VADER sentiment analyzer to check if the article described the band in a positive, negative, or neutral way. As a final step, I built a Markov chain text generator that created new sentences based on the band’s Wikipedia page. The generated text wasn’t perfect, but it was fun to see how the model imitated the article’s tone and vocabulary. Throughout the project, I learned a lot about working with APIs, cleaning data, and experimenting with text-based models, all while exploring a topic that genuinely interested me.

3. Results

After cleaning and analyzing the text from The Neighbourhood (band) Wikipedia article, several interesting patterns appeared. The most frequent words were band, released, album, lead, love, single, imagine, and music, which makes sense given how much of the article focuses on the band’s discography and musical style. The bar chart clearly highlighted these recurring terms, while the word cloud visually emphasized the same ideas, showing that the Wikipedia page revolves heavily around their albums, releases, and artistic identity.

The sentiment analysis using NLTK’s VADER tool produced the following results:

neg: 0.019
neu: 0.911
pos: 0.071
compound: 0.9959
Overall Sentiment: 😊 Positive

These scores indicate that the overall tone of the article is highly positive, with very little negativity. This suggests that the band is presented favorably on Wikipedia, focusing more on their creative output and achievements rather than the it has faced.

For the Markov text synthesis, I generated new text based on the cleaned article. The results were often jumbled but still recognizable, repeating words like “album,” “released,” and “imagine” in random but musically themed combinations. For example, one output read: “Revenge serious august premiered music video released march included tracks previous extended plays including lead single scary love...” Another example produced: “Band april neighbourhood announced summer tour called love collection tour along lovelife jmsn well planning release mixtape december released new ep called hard imagine...” While some of the generated sentences don’t follow clear grammar or structure, they reflect how the Markov model mimics the repetitive phrasing and musical terminology from the original article. This shows how text generation models can capture patterns of language without fully understanding context, producing interesting and sometimes chaotic outputs that still sound like something a band’s Wikipedia might say.

4. Reflection

Overall, this project went really well and helped me gain a deeper understanding of how to work with text data in Python. I was proud of how I broke the project into smaller functions, which made it easier to test and fix errors along the way. The biggest challenge was cleaning and preprocessing the text correctly — especially when I was first getting random outputs or symbols that didn’t make sense. Another challenge was using a .py file as it is a bit different than the juypter files I usually use. I had to understand the terminal section more as I was downloading libraries and especially learning the difference between running commands in the Python terminal, PowerShell, and my computer’s regular terminal. At first, I kept getting errors like “module not found,” and I didn’t realize that I was entering commands in the wrong place. Once I learned to use pip install correctly inside the right environment, everything started running smoothly. If I were to improve the project, I’d try to generate text from multiple band articles and compare how their styles differ. This would allow me to do comparison visualization models as well.

From a learning perspective, my biggest takeaway was realizing how much potential text analysis has for exploring culture, music, and media through data. Before this project, I had never used APIs or performed text-based sentiment analysis, so learning how to fetch, clean, and visualize data felt like a big step forward. Using AI tools helped me understand new libraries like MediaWiki, NLTK, and wordcloud faster and guided me when I got stuck. Going forward, I feel more confident about applying Python to real-world datasets, especially when combining data science and creativity. I also learned that debugging patiently and documenting my process through comments and doc strings made the work much smoother because I was able to recall my steps faster. This assignment definitely tested my patience with the bugs I had to fix or modifying the code so it runs how I inteded it to and this is something I’ll definitely carry into future projects.
309 changes: 309 additions & 0 deletions textassignment.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
#######################################################################################################
#starting code for text assignment
#importing library
from mediawiki import MediaWiki

#making the pull of text a function so i can reuse later with less lines of code
def get_text(topic):
wikipedia = MediaWiki()
page = wikipedia.page(topic)
return page.content

if __name__ == "__main__":
text = get_text("The Neighbourhood (band)") #one of my fav bands
print(text[:1000])
#making sure package works by printing first 1000 characters of the text

#pickling the text for later use
import pickle

#https://wiki.python.org/moin/UsingPickle (used this to help)
with open("the_neighbourhood_band.pkl", "wb") as f:
pickle.dump(text, f)
#######################################################################################################
#text cleaning and preprocessing
import re
import string

def clean_text(text):
"""cleans and preprocesses text for analysis"""
# lowercase the text
text = text.lower()

# remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))

# remove numbers and other symbols using regex
text = re.sub(r'\d+', '', text)

# remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()

# split into words (tokenization, learned from machine learning class for easier data manipulation of text)
words = text.split()

return words

# testing the function
words = clean_text(text)
print(words[:30]) # first 30 cleaned words
#######################################################################################################
#removing stop words
#importing nltk package (used https://www.nltk.org/book/ch01.html as reference)

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def remove_stopwords(words):
"""removes common stop words from a list of words"""
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if w not in stop_words]
return filtered

# testing the function
words = clean_text(text)
print("Before stopword removal:", len(words), "words")

filtered_words = remove_stopwords(words)
print("After stopword removal:", len(filtered_words), "words")
print(filtered_words[:30])
#######################################################################################################
#word frequency analysis
#using a dictionary function to do this
def word_frequency(words):
"""count word frequencies using a dictionary"""
freq = {} # empty dictionary

for word in words:
if word in freq:
freq[word] += 1 # if word exists, increment count
else:
freq[word] = 1 # if new word, set count to 1

return freq

freq = word_frequency(filtered_words)

#print all word freqs
for word, count in freq.items():
print(f"{word}: {count}")

######################################################################################################
#summary statistics

def summary_statistics(text, words, freq):
"""compute and print key summary statistics for the text"""

# document length (in words)
total_words = len(words)

# unique words
unique_words = len(set(words))

# vocabulary richness (unique / total)
vocab_richness = unique_words / total_words if total_words > 0 else 0

# average word length
avg_word_length = sum(len(word) for word in words) / total_words if total_words > 0 else 0

# sentence stats (split text by punctuation)
sentences = re.split(r'[.!?]+', text)
sentences = [s.strip() for s in sentences if s.strip() != ""]
total_sentences = len(sentences)
avg_sentence_length = total_words / total_sentences if total_sentences > 0 else 0

# top 10 frequent words
sorted_freq = dict(sorted(freq.items(), key=lambda item: item[1], reverse=True))
top_10 = list(sorted_freq.items())[:10]

#printing all results
print("\nthe neighbourhood text stats:")
print(f"Total words (after cleaning): {total_words}")
print(f"Unique words: {unique_words}")
print(f"Vocabulary richness: {vocab_richness:.3f}")
print(f"Average word length: {avg_word_length:.2f}")
print(f"Total sentences: {total_sentences}")
print(f"Average sentence length (words): {avg_sentence_length:.2f}")
print("\nTop 10 most frequent words:")
for word, count in top_10:
print(f" {word}: {count}")

#calling function to print summary stats
summary_statistics(text, filtered_words, freq)
######################################################################################################
#data visualization

import matplotlib.pyplot as plt
from wordcloud import WordCloud

def plot_top_words(freq, n=10):
"""
visualize the top 10 most frequent words as a bar chart.

learned from:
https://matplotlib.org/stable/tutorials/introductory/pyplot.html
and https://www.w3schools.com/python/matplotlib_intro.asp
and used ai agent help to format code properly
"""
# sort the frequency dictionary by value (descending order)
sorted_freq = dict(sorted(freq.items(), key=lambda item: item[1], reverse=True))
top_items = list(sorted_freq.items())[:n]
words, counts = zip(*top_items)

# create bar chart
plt.figure(figsize=(10, 6))
plt.bar(words, counts, color='skyblue')
plt.title(f"Top {n} Most Frequent Words")
plt.xlabel("Words")
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()


def make_wordcloud(freq):
"""
Generate and display a word cloud from word frequencies.

Technique inspired by:
https://amueller.github.io/word_cloud/
and https://www.geeksforgeeks.org/generating-word-cloud-python/
"""
wc = WordCloud(
width=800,
height=400,
background_color='white',
max_words=200
).generate_from_frequencies(freq)

plt.figure(figsize=(10, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title("Word Cloud of Most Frequent Words", fontsize=16)
plt.show()

#calling functions to show chart outputs
plot_top_words(freq, n=10)
make_wordcloud(freq)
#######################################################################################################
#advanced technique - vader from nltk and markov

#used these sources to help with vader sentiment analysis
#https://www.nltk.org/_modules/nltk/sentiment/vader.html
#https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/

#technique 1 : sentiment analysis with vader
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def analyze_sentiment(text):
"""analyze sentiment of the text using NLTK's VADER."""
sia = SentimentIntensityAnalyzer()
scores = sia.polarity_scores(text)

print("\nsentiment analysis results:")
for k, v in scores.items():
print(f"{k}: {v}")

# basic interpretation
if scores["compound"] >= 0.05:
print("\nOverall Sentiment: 😊 Positive")
elif scores["compound"] <= -0.05:
print("\nOverall Sentiment: 😠 Negative")
else:
print("\nOverall Sentiment: 😐 Neutral")

#calling the function to analyze sentiment
analyze_sentiment(text)

#technique 2 : markov text generation

# sources used
# "Think Python" by Allen B. Downey (Markov example)
# Real Python Markov Tutorial: https://realpython.com/markov-chains/
# used chatgpt to explain and help with this one


import random

def build_markov_chain(words, n=2):
"""
build a Markov chain as a dictionary mapping tuples of 'n' words to possible next words
"""
markov_chain = {}
for i in range(len(words) - n):
key = tuple(words[i:i + n])
next_word = words[i + n]
if key not in markov_chain:
markov_chain[key] = []
markov_chain[key].append(next_word)
return markov_chain


def generate_markov_text(chain, num_words=100):
"""
generate new text using the Markov chain.
randomly starts from one of the keys and keeps selecting next words based on the learned transitions.
"""

start_key = random.choice(list(chain.keys()))
result = list(start_key)

for _ in range(num_words - len(start_key)):
next_words = chain.get(tuple(result[-len(start_key):]), None)
if not next_words:
break
next_word = random.choice(next_words)
result.append(next_word)

return ' '.join(result)

#printing output
print("\n generating markov text \n")
chain = build_markov_chain(filtered_words, n=2)
generated_text = generate_markov_text(chain, num_words=100)
print(generated_text)
#######################################################################################################
#main function to run all parts of the project
def main():
# get Wikipedia text
text = get_text("The Neighbourhood (band)") # one of my fav bands
print(text[:1000]) # sanity check

# pickle text for later use
with open("the_neighbourhood_band.pkl", "wb") as f:
pickle.dump(text, f)

# clean and preprocess
words = clean_text(text)
print(words[:30])

# remove stopwords
print("Before stopword removal:", len(words))
filtered_words = remove_stopwords(words)
print("After stopword removal:", len(filtered_words))
print(filtered_words[:30])

# word frequency + summary stats
freq = word_frequency(filtered_words)
summary_statistics(text, filtered_words, freq)

# visualizations
plot_top_words(freq, n=10)
make_wordcloud(freq)

# sentiment analysis
analyze_sentiment(text)

# markov text synthesis
print("\n--- Generating Markov Text ---\n")
chain = build_markov_chain(filtered_words, n=2)
generated_text = generate_markov_text(chain, num_words=100)
print(generated_text)

# run the project
if __name__ == "__main__":
main()
Binary file added the_neighbourhood_band.pkl
Binary file not shown.