For this assignment, we will apply the NMF algorithm to a corpus of NYT articles to discover latent topics. The NYT sections (topics) are great, but we don't know how they relate to patterns in article content. Let us see what insights we can mine out of our corpus! We will be starting with our bag of words matrix. Use the 1405 articles located in the data directory.
-
Read the articles.pkl file using the read_pickle function in pandas. Look at the result and understand the structure of your data. Once you are comfortable with the data, store the 'content' series you read in, as this is what we will be working with for the rest of the assignment.
-
Use the CountVectorizer from scikit-learn (or Tfidf) to turn the content of the news stories into a document-term matrix. Choose a reasonable value (like 5000) for max_features when initializing the vectorizer.
-
Use the get_feature_names method of your vectorizer to store the word represented by each column of your document-term matrix.
With the document matrix (our bags of words), we can begin implementing the NMF algorithm.
- Create a NMF class that is initialized with a document matrix (bag of words or tf-idf) V. As arguments (in addition to the document matrix) it should also take parameters k (# of latent topics) and the maximum # of iterations to perform.
First we need to initialize our weights (W) and features (H) matrices.
-
Initialize the weights matrix (W) with positive random values to be a n x k matrix, where n is the number of documents and k is the number of latent topics.
-
Initialize the feature matrix (H) to be k x m where m is the number of words in our vocabulary (i.e. length of bag). Our original document matrix (V) is a n x m matrix. NOTICE: shape(V) = shape(W * H)
-
Next implement your class's
fit()method. Use a least-squares error metric when we update the matrices W and H. This allows us to use thenumpy.linalg.lstsqsolver. To start, we will update H by callinglstsq, holding W fixed and minimizing the sum of squared errors predicting the document matrix. Since these values should all be at least 0, clip all the values in H after the call tolstsq. -
Use the
lstsqsolver to update W while holding H fixed. Thelstsqsolver assumes it is optimizing the right matrix of the multiplication (e.g. x in the equation Ax=b). So you will need to get creative so you can use it and have the dimensions line up correctly. Brainstorm on paper or a whiteboard how to manipulate the matrices so thatlstsqcan get the dimensionality correct and optimize W. hint: it involves transposes. Clip W appropriately after updating it withlstsqto ensure it is at least 0. -
Inside your class's
fit()method, repeat steps 4 and 5 for a fixed number of iterations, or until convergence (i.e. cost(V, W*H) is close to 0). -
Return the computed weights matrix and features matrix.
-
Write a method that uses W, H, and the document matrix (V) to calculate and return the mean-squared error (of V - WH).
-
Using argsort on each topic in H, find the index values of the words most associated with that topic. Combine these index values with the word-names you stored in the Preliminaries section to print out the most common words for each topic.
-
Use the scikit-learn NMF algorithm to compute the Non-Negative Matrix factorization of our documents. Explore what "topics" are returned.
-
Run the code you wrote for the Using Your NMF Function section on the SKlearn classifier. How close is the output to what you found using your own NMF classifier?
-
Can you add a title to each latent topic representing the words it contains?
-
Now that you have labeled the latent features with what topics they represent, explore strongest latent features for a few articles. Do these make sense given the article? You will have to go back to the raw data you read in to do this.
-
How do the NYT sections compare to the topics from the unsupervised learning? What are the differences? Why do you think these differences exist?
- Define a function that displays the headlines/titles of the top 10 documents for each topic.
- Define a function that takes as input a document and displays the top 3 topics it belongs to.
- Define a function that ensure consistent ordering between your NMF class and the sklearn NMF class.