Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added Apple_Inc._top10.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Apple_Inc._wordcloud.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Microsoft_top10.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Microsoft_wordcloud.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 44 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,47 @@
# Text-Analysis-Project

Please read the [instructions](instructions.md).

Project Overview:

For this project, I did my best to apply the TF-IDF methods I learned in my Machine Learning class. However, although I had a solid idea of tokenization and such, I had issues performing those methods in Python rather than R Studio, which is something that AI helped me tremendously.
For this project I collected wikipedia articles as my main data source through the MediaWiki API from the mediawiki Python library. I collected data particularly from Apple, Microsoft and Samsung to compare the most frequently used terms in each article and identify the top TF-IDF keywords unique to each company since I wanted to find the differences of focus between companies within the same industry.

Implementation:
I used multiple Python files:
- fetch.py (Uses MediaWiki to fetch article text for each topic.)
- clean.py (Cleans and tokenizes text by removing punctuation, converting to lowercase, and removing stop words)
- tfidf.py (Calculates TF, IDF, and TF-IDF scores using dictionaries and counters)
- visualize.py (Generates bar charts of the top keywords and word clouds to visualize the most important terms)
- main.py (Integrates all of the files above mentioned and also saves the output graphs in a figures/ folder)![alt text](image-1.png)

I applied the term frequency–inverse document frequency (TF–IDF) method to identify which words were most unique to each company’s article. Term frequency (TF) captured how often a word appeared in one document, while inverse document frequency (IDF) measured how rare it was across all three documents. The multiplication of the two gave each term a weight that emphasized words distinctive to a specific company. I then visualized the results with matplotlib bar charts and word clouds, saving each plot to a /figures/ directory.

A key design decision I faced was whether to analyze the text using TF-IDF or a sentiment analysis approach with libraries like NLTK. I chose TF-IDF because I was more familiar with it than NLTK.

Some issues I had, which made me glad that I kept on checking my results was this (![alt text](image.png)) since I had to make sure that I got rid of the Samsung('s) or else the TF-IDF will simply yield another result that doesn't matter that much.
Throughout the development of this project, I used ChatGPT to help debug TF-IDF weighting logic, manage file paths, and structure my functions cleanly.


Results:
Apple Top 10 TF-IDF![alt text](image-2.png)
Microsoft Top 10 TF-IDF ![alt text](../figures/Microsoft_top10.png)
Samsung Top 10 TF-IDF ![alt text](../figures/Samsung_Electronics_top10.png)

Apple WordCloud ![alt text](image-3.png)
Microsoft WordCloud ![alt text](../figures/Microsoft_wordcloud.png)
Samsung WordCloud ![alt text](../figures/Samsung_Electronics_wordcloud.png)

For Apple, top keywords like apple, jobs, macintosh, and iphone highlight the company’s focus on product design and its strong link to co-founders Steve Jobs and Steve Wozniak.
Microsoft stood out with terms like windows, azure, xbox, and gates, showing its history in software, gaming, and cloud computing.
For Samsung, words such as galaxy, semiconductor, and design reflect its strength in hardware, electronics, and large-scale production.

Overall, the graphs and word clouds show that each company’s Wikipedia article reflects its core identity — Apple with innovation and products, Microsoft with software and cloud systems, and Samsung with manufacturing and technology.

Reflection:

When starting this project, I knew I'll use AI alot, but I know the code might not always work whenever I asked it to do something. So before I started, I decided to be very direct and specific with what I wanted to do and created a step by step process to achieve it. Fetching first, then cleaning...., which made the procress much smoother. Initially I wanted to conduct sentiment analysis on Amazon products since it would help users find the most loved producted. I wish I was better at organizing multiple files and functions across different modules since it felt confusing at first.
My biggest takeaways is that I should have a step by step plan when writing code. Just like how we need a docstring for functions, we need a plan for a project.
Overall, this project taught me not only about text analytics, but also about writing cleaner, modular Python code that’s easy to maintain and scale.


Binary file added Samsung_Electronics_top10.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Samsung_Electronics_wordcloud.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 74 additions & 0 deletions clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# clean the feteched data

import unicodedata
import re
import nltk
from nltk.corpus import stopwords


def split_line(line:str): # Finds split lines so tokenization is easier
return line.replace('-', ' ').replace('–', ' ').replace('—', ' ').split()

def find_punctuation(text:str): # Finds punctuation in text to remove
punc_marks = {}
for char in text:
category = unicodedata.category(char)
if category.startswith("P"):
punc_marks[char] = 1
return "".join(punc_marks)

def clean_word(word: str, punctuation: str):
return word.strip(punctuation).lower()

POSSESSIVE_RE = re.compile(r"[’']s\b") # samsung's -> samsung
TRAILING_APOS_RE = re.compile(r"s[’']\b") # companies' -> companies
URL_RE = re.compile(r'(https?://\S+|www\.[^\s]+)', re.IGNORECASE)
DIGITS_RE = re.compile(r'\d+')
MULTISPACE_RE = re.compile(r'\s+')


def clean_basic(text: str) -> str:
"""Remove URLs, possessives, digits/numbers, and collapse whitespace.
I am familiar with the reason we needed to preprocess and clean the data, and AI
helped suggest a way to do it"""
# remove URLs
text = URL_RE.sub(' ', text)
# normalize curly and straight apostrophes first (to make regex consistent)
text = text.replace("’", "'").replace("‘", "'")
# remove possessive endings (e.g., samsung's → samsung)
text = re.sub(r"\b(\w+)'s\b", r"\1", text)
# remove plural possessives (e.g., companies' → companies)
text = re.sub(r"\b(\w+)'\b", r"\1", text)
# remove digits and numbers (e.g., 2024, 10,000)
text = DIGITS_RE.sub(' ', text)
# collapse multiple spaces/newlines/tabs
text = MULTISPACE_RE.sub(' ', text).strip()
return text

def clean_then_strip_punct_and_lower(text: str) -> list[str]: # Tokenizes Words
cleaned = clean_basic(text)
punctuation = find_punctuation(cleaned)
words = []
for line in cleaned.splitlines():
for w in split_line(line):
w = clean_word(w, punctuation)
if w:
words.append(w)
return words
# download once (safe to leave here)
nltk.download('stopwords', quiet=True)

# base English stopwords
STOP = set(stopwords.words('english'))

# optional: domain-specific stopwords for Wikipedia/company pages (AI helped generate this idea to me since it made sense to get rid of generic words)
EXTRA_STOP = {
'inc', 'ltd', 'co', 'corp', 'corporation', 'company',
'electronics', 'technology', 'software', 'services',
'usa', 'us', 'u', 's', '===','==','=', # sometimes appear after punctuation removal and AI helped me get rid of these words by suggesting this method
}
STOP |= EXTRA_STOP

def remove_stopwords(tokens: list[str]) -> list[str]:
"""Filter out stop words and very short leftovers."""
return [t for t in tokens if t not in STOP and len(t) > 2]
39 changes: 39 additions & 0 deletions fetch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# fetch.py
# Fetch multiple Wikipedia articles by title and return {title: content}

from mediawiki import MediaWiki
from mediawiki.exceptions import DisambiguationError, PageError

def fetch_articles(titles: list[str]) -> dict[str, str]:
wikipedia = MediaWiki()
articles: dict[str, str] = {}

for title in titles:
try:
# normal fetch
page = wikipedia.page(title)
articles[title] = page.content

except DisambiguationError as e:
# if ambiguous, try the first suggested option
try:
alt = e.options[0]
page = wikipedia.page(alt)
articles[title] = page.content
print(f"⚠️ '{title}' was ambiguous; used '{alt}'.")
except Exception as e2:
print(f"⚠️ Could not fetch {title} (disambiguation): {e2}")

except PageError as e:
print(f"⚠️ Could not fetch {title}: {e}")

except Exception as e:
print(f"⚠️ Could not fetch {title}: {e}")

return articles

if __name__ == "__main__":
# tiny self-test
data = fetch_articles(["Tesla"])
for k in data:
print("Fetched:", k, "chars:", len(data[k]))
65 changes: 65 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from fetch import fetch_articles
from mediawiki import MediaWiki
from clean import clean_then_strip_punct_and_lower, remove_stopwords
from tfidf import compute_tf, compute_idf, compute_tfidf
from visualize import plot_top_keywords, generate_wordcloud
from collections import Counter
from visualize import plot_top_keywords, generate_wordcloud
import os

def compute_word_frequency(tokens):
"""Count raw word frequencies for a document."""
return Counter(tokens)

def main():
topics = [ "Apple Inc.", "Microsoft", "Samsung Electronics"]
articles = fetch_articles(topics)

cleaned_docs = {}

# Step 1: Clean & preprocess
for title, content in articles.items():
tokens = clean_then_strip_punct_and_lower(content)
tokens_no_stop = remove_stopwords(tokens)
cleaned_docs[title] = tokens_no_stop

# Step 2a: Word frequency (raw counts) BEFORE TF-IDF
for title, tokens in cleaned_docs.items():
word_freq = compute_word_frequency(tokens) # Counter
top_words = word_freq.most_common(10)
print(f"\n--- {title}: Top 10 Frequent Words ---")
for word, freq in top_words:
print(f"{word:<15} {freq}")

# Step 2b: Compute IDF across all documents
idf = compute_idf(cleaned_docs.values())

outdir = "figures"
os.makedirs(outdir, exist_ok=True)

# Step 3: Compute TF-IDF for each document
for title, tokens in cleaned_docs.items():
tfidf_scores = compute_tfidf(tokens, idf)

# get top 10 keywords by TF-IDF score
top_keywords = sorted(tfidf_scores.items(), key=lambda x: x[1], reverse=True)[:10]

print(f"{title}: TF-IDF terms count =", len(tfidf_scores))
print(f"\n--- {title} ---")

for word, score in top_keywords:
print(f"{word:<20} {score:.4f}")
print("-" * 60)

# Visualize
plot_top_keywords(
tfidf_scores, title, n=10,
save_path=f"{outdir}/{title.replace(' ', '_')}_top10.png"
)
generate_wordcloud(
tfidf_scores, title,
save_path=f"{outdir}/{title.replace(' ', '_')}_wordcloud.png"
)

if __name__ == "__main__":
main()
26 changes: 26 additions & 0 deletions tfidf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
import math
from collections import Counter, defaultdict

def compute_tf(tokens): # AI/Chat helped me perform the calculations for TF and IDF, which was taught to me in my machine learning class
"""Compute term frequency for one document."""
tf = Counter(tokens)
total = sum(tf.values())
return {term: freq / total for term, freq in tf.items()}

def compute_idf(all_docs):
"""Compute inverse document frequency from multiple docs."""
N = len(all_docs)
idf = defaultdict(int)

# count in how many docs each term appears
for doc in all_docs:
for term in set(doc):
idf[term] += 1

# compute IDF
return {term: math.log((N + 1) / (df + 1)) + 1 for term, df in idf.items()}

def compute_tfidf(tokens, idf):
"""Compute TF-IDF for one document given its IDF dictionary."""
tf = compute_tf(tokens)
return {term: tf_val * idf.get(term, 0) for term, tf_val in tf.items()}
51 changes: 51 additions & 0 deletions visualize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# visualize.py
import os
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def plot_top_keywords(tfidf_scores, title, n=10, save_path=None):
"""Bar chart of top n TF-IDF terms. If save_path is given, saves PNG instead of showing.
My experience wtih TF-IDF was with R so I needed alot of help from Ai to be able to perform this task"""

top_items = sorted(tfidf_scores.items(), key=lambda x: x[1], reverse=True)[:n]
if not top_items:
return
words, scores = zip(*top_items)

plt.figure(figsize=(8, 4))
plt.barh(words, scores)
plt.gca().invert_yaxis()
plt.title(f"Top {n} TF-IDF Keywords – {title}")
plt.xlabel("TF-IDF Score")
plt.tight_layout()

if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
plt.savefig(save_path, dpi=150)
plt.close()
else:
plt.show()


def generate_wordcloud(tfidf_scores, title, save_path=None):
"""Word cloud based on TF-IDF weights. If save_path is given, saves PNG instead of showing.
I am familiar with how word clouds work and their functionality,
but still, since it was new to me in python, I heavily needed help from AI"""
if not tfidf_scores:
return

wc = WordCloud(width=800, height=400, background_color='white')
wc.generate_from_frequencies(tfidf_scores)

plt.figure(figsize=(8, 4))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title(f"Word Cloud – {title}")
plt.tight_layout()

if save_path:
os.makedirs(os.path.dirname(save_path), exist_ok=True)
plt.savefig(save_path, dpi=150)
plt.close()
else:
plt.show()