OIM3640 · asarych11 · Apr 8, 2025 · Apr 8, 2025
diff --git a/README.md b/README.md
@@ -1,3 +1,96 @@
 # Text-Analysis-Project
 
 Please read the [instructions](instructions.md).
+
+## 1. Project Overview
+
+    I decided to use this project to analyze the languages of two ideologically radically opposite news sources: 
+HuffPost (left-leaning) and The Epoch Times (right-leaning). I did so by extracting 40-50 latest news articles
+from each source using the 'newspaper' Python library and processing them using a variety of programmatic tools
+from simple code to 'collections', '.json', 'nltk' libraries. The research question is: **How does the language used
+by political media sources and the sentiment conveyed by them differ based on their ideology?** I hoped to not
+only extract and clean the data, but also uncover meaningful patterns in their political narratives through
+comparing their word frequencies, vocabularies and sentiments. 
+
+## 2. Implementation
+
+    I can separate my project into three main steps: **data extraction/acquisition**, **data processing** (cleaning,
+sorting), and analysis. First I looked at multiple political news sources to find the most fitting ones, not only
+in terms of content and ideology, but also in terms of use of accessibility of their API and how easily they could 
+be used and implemented into my code. ChatGPT was a big help here as it quickly identified matching news sources in
+both criterias, and helped deal with (403 errors). I tested different sources, and rotated user-agent headers before
+settling on HuffPost and The Epoch Times. I scraped 50 articles from both website using the 'newspaper' library and
+stored tham as '.json' files containing a list of dictionaries containing their titles, URLs, and body texts.
+    In the processing phase, I used various NLP techniques to clean the data nd organize it. I used 'string' library
+and 'isalpha()' to remove punctuation, non-alphabetic characters, and lowercased everything. I was planning to use
+'nltk' for analysis, but also used it to remove stop words, although had to mannually add some because my cleaning
+technique wasn't as perfect. I then build histograms of word frequencies for each article, which helped me later
+with analysis, but also helped me single out the **top words** from each article: instead of arbitrarily selecting
+the most common one, I gathered all words that had at least (max_frequency - 10) appearances. This way I created
+the richer vocabularies of most representative political terms for both sources. I then played around with the code
+and ended up creating multiple '.json' files - with unique words, unique top words (3320 HuffSpot vs 3420 The Epoch
+Times), top words(5617 HuffSpot vs 6104 The Epoch Times). 
+    I then used these findings to conduct **aggregate frequency analyses** and **sentiment analysis**. I merged all
+unique top words per source, counted their frequencies, and compared shared and exclusive vocabularies. I applied
+VADER from 'nltk.sentiment' analysis package to access overall emotional tone of each article, and average those
+scores to compare outlets. I used GenAI tools to debug unfamiliar errors, understand library documentation, 
+brainstorm NLP strategies, compare newssources and bypass website restrictions. For a while I considered a different
+approach to the entire project where I would extract the political vocabulary and sort articles based on that, 
+which involved using a different documentation ('mediaWiki'), which OpenAI was also helpful with. 
+
+## Results
+
+    The frequency and vocabulary analyses showed clear patterns. As mentioned above, HuffPost produced a total of 5617
+top words with 3320 of them being unique. The Epoch Times had 6104 words with unique being 3423. Out of those, only 
+1193 were shared by both sources, leabing 2207 unique to HuffPost and 2230 to The Epoch Times. This suggests a strong
+difference in languages used by the two sources, confirming that each of them uses a distinctive terminology to 
+construct their political narratives. As for top words, they also reflect different vocabularies. Here's the ouput 
+example:
+
+    HuffPost Top 20 Words:                                                           Epoch Times Top 20 Words:
+    back: 24                                                                         china: 16
+    first: 23                                                                        united: 15
+    huffpost: 22                                                                     states: 15
+    help: 22                                                                         trump: 14
+    free: 21                                                                         april: 14
+    moment: 21                                                                       tariffs: 13
+    without: 21                                                                      need: 13
+    experience: 21                                                                   president: 12
+    support: 20                                                                      house: 12
+    fair: 20                                                                         make: 12
+    news: 19                                                                         chinese: 11
+    supported: 19                                                                    first: 11
+    honest: 19                                                                       something: 11
+    wont: 19                                                                         times: 11
+    mission: 19                                                                      last: 11
+    providing: 19                                                                    percent: 10
+    critical: 19                                                                     effect: 10
+    offering: 19                                                                     people: 10
+    qualifying: 19                                                                   going: 10
+    contributors: 19                                                                 years: 10
+
+HuffPost's words include "support", "free", "fair", "contributors", "experience", suggesting it appeals to values like 
+fairness and reader engagement. The Epoch Times includes "china", "tariffs", "president", "communist", and "freedom" - 
+more menacing, serious words indicating a focus on national politics and foreign relations. Of course, it is tough to 
+judge when analyzing just two little-known artiles. 
+    Sentiment analysis provided a very strong contrast too. THe average sentiment score for HuffPost was 0.7757,
+suggesting a positive tone, while a score of 0.2723 for The Epoch Times suggest a more neutral tone. This could reflect
+outlets' differing rhetorical priorities. HuffPost's language aims to uplift it's audience, especially around topics 
+like social justice, rights, and activism. While TET engages in ideological critique, geopolitical conflicts, etc., 
+skewing towards neutral and cautious language. HuffPost's stories are more personal and value-driven, while TET frames
+stories around systemic challenges and national concerns, which lend themselves to more emotionally restrained language.
+
+## 4. Reflection
+
+    From a process point of view, I believe once I was able to extract the files, it was smooth sailing from there. The
+main roadblocks for me personally were in file extraction as I wasn't sure how to deal with 403 Errors and wasn't as
+comfortable with APIs. Also, after extracting everything and organizing and cleaning data, I got a little lost as in what
+I should do next, but was able to figure it out. I believe the results yilded through the project are not the best reflection
+of differing languages, and there is always room for improvement. I need to learn more NLP techniques to allow for better 
+ideation of what I can do with the data, and have a clear plan ahead of how I should process it. For example, after separating
+top words into lists, I wasn't sure how to use those lists for sentiment or similarity analyses - it's just a wrong type of
+data for that, so I need to work on my look ahead. Another big takeaway is how powerful NLP tools like frequency counts and 
+VADER sentiment can be when used systematically. I was also initially intimidated by the modular design of the project, e.g. 
+how I would organize everything, but learned it's value and realized how important it is to keep codes and project parts separate
+as it makes it much easier. If I could start over, I would build a clearer testing pipeline earlier to work with intermidiate
+outputs. Going forward, I will apply what I learned in other projects and experiments involving language processing. 
diff --git a/project/analysis/aggregate_frequency.py b/project/analysis/aggregate_frequency.py
@@ -0,0 +1,30 @@
+import json
+from collections import Counter
+
+# Load word lists
+with open("project/data/huffpost__cleaned4.json", encoding="utf-8") as file:
+    huffpost_words = json.load(file)
+    huffpost_unique = set(huffpost_words) # turn them into sets for later
+
+with open("project/data/theepochtimes_cleaned4.json", encoding="utf-8") as file:
+    epochtimes_words = json.load(file)
+    epochtimes_unique = set(epochtimes_words)
+
+huffpost_counts = Counter(huffpost_words) #count frequency
+epochtimes_counts = Counter(epochtimes_words)
+
+print("\nHuffPost Top 20 Words:") # top 20 most common words from each source
+for word, count in huffpost_counts.most_common(20):
+    print(f"{word}: {count}")
+
+print("\nEpoch Times Top 20 Words:")
+for word, count in epochtimes_counts.most_common(20):
+    print(f"{word}: {count}")
+
+shared_words = huffpost_unique & epochtimes_unique
+only_huffpost = huffpost_unique - epochtimes_unique
+only_epochtimes = epochtimes_unique - huffpost_unique
+
+print(f"\nShared words: {len(shared_words)}")
+print(f"Words only in HuffPost: {len(only_huffpost)}")
+print(f"Words only in Epoch Times: {len(only_epochtimes)}\n")
diff --git a/project/analysis/epohctimes_analysis.py b/project/analysis/epohctimes_analysis.py
@@ -0,0 +1,105 @@
+import json
+import string
+from nltk.corpus import stopwords
+import nltk
+nltk.download('stopwords')
+
+def preprocess_text(text):
+    """
+    cleans and tokenizes the text. makes it lowercase, removes the punctuation and whitespace.
+    """
+    words = text.split()
+    cleaned_words = []
+    for word in words:
+        cleaned_word = word.strip(string.punctuation + string.whitespace).lower()
+
+        extra_cleaned_word = '' # cleans the word further. used this in one of the text analysis excercises.
+
+        for char in cleaned_word:
+            if char.isalpha():
+                extra_cleaned_word += char
+
+        extra_cleaned_word = extra_cleaned_word.lower()
+
+        if extra_cleaned_word:
+            cleaned_words.append(extra_cleaned_word)
+    return cleaned_words
+
+def extract_histograms_from_file(filepath):
+    """
+    loads JSON file and builds a sorted histogram for each article.
+    returns a list of dictionaries sorted by descending frequency.
+    """
+    with open(filepath, "r", encoding="utf-8") as file:
+        articles = json.load(file)
+
+    histograms = []
+
+    for article in articles[8:]: # the first 8 articles downloaded as descriptions of news categories so I'm removing them
+        text = article.get("text", "")
+        words = preprocess_text(text)
+        histogram = {}
+
+        for word in words:
+            histogram[word] = histogram.get(word, 0) + 1
+
+        sorted_histogram = dict(sorted(histogram.items(), key=lambda item: item[1], reverse=True)) # sort by value descending, and rebuild the dictionary in sorted order
+        histograms.append(sorted_histogram)
+
+    return histograms
+
+def remove_stopwords_from_histogram(histogram, stop_words):
+    """
+    removes stop words from all histograms. the stop word list in __name__ was 
+    created mannually by looking through outputs of the function and finding stop
+    words and adding them to the list. 
+    """
+
+    filtered_histogram = {} # create an empty dictionary to store the filtered words
+
+    for word, count in histogram.items(): # loop through each word and its count in the original histogram
+        if word not in stop_words: # check if the word is NOT in the stop word list
+            filtered_histogram[word] = count # add it to the new dictionary
+
+    # Return the cleaned dictionary
+    return filtered_histogram
+
+def extract_top_words_from_histograms(histograms):
+    """
+    for each article's histogram, finds the highest word count,
+    then collects all words with at least (max_count - 10) frequency.
+    returns one combined list of selected words.
+    """
+    selected_words = []
+
+    for histogram in histograms:
+
+        max_freq = max(histogram.values())
+        threshold = max_freq - 10
+
+        for word, count in histogram.items():
+            if count >= threshold:
+                selected_words.append(word)
+
+    return selected_words
+
+if __name__ == "__main__":
+    stop_words = set(stopwords.words("english"))
+    stop_words.update(["advertisement", "said", "mr", "us", "—", "“i", "didnt", "youll", "loadingerror", "use", "well", "time", 
+                       "also", "one", "like", "would", "many", "new", "including", "adfree", "way", "could", "youve", "cant", "adfree"])
+
+    histograms = extract_histograms_from_file("project/data/theepochtimes_articles.json") # load histograms from file
+
+    clean_histograms = [] # apply stopword removal to each histogram and store in a list
+
+    for h in histograms:
+        cleaned = remove_stopwords_from_histogram(h, stop_words)
+        clean_histograms.append(cleaned)
+
+    top_words = extract_top_words_from_histograms(clean_histograms)
+
+    with open("project/data/theepochtimes_cleaned4.json", "w", encoding = "utf-8") as file:
+        json.dump(top_words, file, indent = 2, ensure_ascii = False)
+
+    # print(json.dumps(top_words, indent=2))
+    # print(len(top_words))
diff --git a/project/analysis/huffpost_analysis.py b/project/analysis/huffpost_analysis.py
@@ -0,0 +1,108 @@
+import json
+import string
+from nltk.corpus import stopwords
+import nltk
+nltk.download('stopwords')
+
+def preprocess_text(text):
+    """
+    cleans and tokenizes the text. makes it lowercase, removes the punctuation and whitespace.
+    """
+    words = text.split()
+    cleaned_words = []
+    for word in words:
+        cleaned_word = word.strip(string.punctuation + string.whitespace).lower()
+
+        extra_cleaned_word = '' # cleans the word further. used this in one of the text analysis excercises.
+
+        for char in cleaned_word:
+            if char.isalpha():
+                extra_cleaned_word += char
+
+        extra_cleaned_word = extra_cleaned_word.lower()
+
+        if extra_cleaned_word:
+            cleaned_words.append(extra_cleaned_word)
+    return cleaned_words
+
+def extract_histograms_from_file(filepath):
+    """
+    loads JSON file and builds a sorted histogram for each article.
+    returns a list of dictionaries sorted by descending frequency.
+    """
+    with open(filepath, "r", encoding="utf-8") as file:
+        articles = json.load(file)
+
+    histograms = []
+
+    for article in articles[8:]: # the first 8 articles downloaded as descriptions of news categories so I'm removing them
+        text = article.get("text", "")
+        words = preprocess_text(text)
+        histogram = {}
+
+        for word in words:
+            histogram[word] = histogram.get(word, 0) + 1
+
+        sorted_histogram = dict(sorted(histogram.items(), key=lambda item: item[1], reverse=True)) # sort by value descending, and rebuild the dictionary in sorted order
+        histograms.append(sorted_histogram)
+
+    return histograms
+
+def remove_stopwords_from_histogram(histogram, stop_words):
+    """
+    removes stop words from all histograms. the stop word list in __name__ was 
+    created mannually by looking through outputs of the function and finding stop
+    words and adding them to the list. 
+    """
+
+    filtered_histogram = {} # create an empty dictionary to store the filtered words
+
+    for word, count in histogram.items(): # loop through each word and its count in the original histogram
+        if word not in stop_words: # check if the word is NOT in the stop word list
+            filtered_histogram[word] = count # add it to the new dictionary
+
+    # Return the cleaned dictionary
+    return filtered_histogram
+
+def extract_top_words_from_histograms(histograms):
+    """
+    for each article's histogram, finds the highest word count,
+    then collects all words with at least (max_count - 10) frequency.
+    returns one combined list of selected words.
+    """
+    selected_words = []
+
+    for histogram in histograms: # had chatgpt help me here as I was getting ValueError: max() arg is an empty sequence
+        # due to at least one of the histograms in clean_histograms being empty
+        if histogram == {}:
+            continue
+
+        max_freq = max(histogram.values())
+        threshold = max_freq - 10
+
+        for word, count in histogram.items():
+            if count >= threshold:
+                selected_words.append(word)
+
+    return selected_words
+
+if __name__ == "__main__":
+    stop_words = set(stopwords.words("english"))
+    stop_words.update(["advertisement", "said", "mr", "us", "—", "“i", "didnt", "youll", "loadingerror", "use", "well", "time", 
+                       "also", "one", "like", "would", "many", "new", "including", "adfree", "way", "could", "youve", "cant", "adfree"])
+
+    histograms = extract_histograms_from_file("project/data/huffpost_articles.json") # load histograms from file
+
+    clean_histograms = [] # apply stopword removal to each histogram and store in a list
+
+    for h in histograms:
+        cleaned = remove_stopwords_from_histogram(h, stop_words)
+        clean_histograms.append(cleaned)
+
+    top_words = extract_top_words_from_histograms(clean_histograms)
+
+    with open("project/data/huffpost__cleaned4.json", "w", encoding = "utf-8") as file:
+        json.dump(top_words, file, indent = 2, ensure_ascii = False)
+
+    # print(json.dumps(top_words, indent=2))
+    # print(len(top_words))
diff --git a/project/analysis/sentiment_analysis.py b/project/analysis/sentiment_analysis.py
@@ -0,0 +1,37 @@
+from nltk.sentiment.vader import SentimentIntensityAnalyzer
+import json
+
+analyzer = SentimentIntensityAnalyzer()
+
+def sentiment_huffpost():
+    with open("project/data/huffpost_articles.json", encoding="utf-8") as file: 
+        articles = json.load(file)
+
+        scores = []
+
+        for article in articles[8:]:  # skip category descriptions
+            text = article.get("text", "")
+            if text.strip():  # ignore empty
+                score = analyzer.polarity_scores(text)
+                scores.append(score["compound"])
+
+        avg_score = sum(scores) / len(scores) # average sentiment score
+        print("Average HuffPost sentiment:", round(avg_score, 4))
+
+def sentiment_epochtimes():
+    with open("project/data/theepochtimes_articles.json", encoding="utf-8") as file: 
+        articles = json.load(file)
+
+        scores = []
+
+        for article in articles:
+            text = article.get("text", "")
+            if text.strip():  # ignore empty
+                score = analyzer.polarity_scores(text)
+                scores.append(score["compound"])
+
+        avg_score = sum(scores) / len(scores) # average sentiment score
+        print("Average The Epoch Times sentiment:", round(avg_score, 4))
+
+sentiment_huffpost()
+sentiment_epochtimes()