diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/octave-core b/Anomaly Detection and Recommender Systems/mlclass-ex8/octave-core deleted file mode 100644 index 0ea3a84..0000000 Binary files a/Anomaly Detection and Recommender Systems/mlclass-ex8/octave-core and /dev/null differ diff --git a/Anomaly Detection and Recommender Systems/ex8.pdf b/AnomalyDetectionandRecommenderSystems/ex8.pdf similarity index 100% rename from Anomaly Detection and Recommender Systems/ex8.pdf rename to AnomalyDetectionandRecommenderSystems/ex8.pdf diff --git a/AnomalyDetectionandRecommenderSystems/ex8.txt b/AnomalyDetectionandRecommenderSystems/ex8.txt new file mode 100644 index 0000000..5299b7c --- /dev/null +++ b/AnomalyDetectionandRecommenderSystems/ex8.txt @@ -0,0 +1,773 @@ +Programming Exercise 8: +Anomaly Detection and Recommender +Systems +Machine Learning +December 6, 2011 + +Introduction +In this exercise, you will implement the anomaly detection algorithm and +apply it to detect failing servers on a network. In the second part, you will +use collaborative filtering to build a recommender system for movies. Before +starting on the programming exercise, we strongly recommend watching the +video lectures and completing the review questions for the associated topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. + +Files included in this exercise +ex8.m - Octave/Matlab script for first part of exercise +ex8 cofi.m - Octave/Matlab script for second part of exercise +ex8data1.mat - First example Dataset for anomaly detection +ex8data2.mat - Second example Dataset for anomaly detection +ex8 movies.mat - Movie Review Dataset +ex8 movieParams.mat - Parameters provided for debugging +multivariateGaussian.m - Computes the probability density function +for a Gaussian distribution +visualizeFit.m - 2D plot of a Gaussian distribution and a dataset +checkCostFunction.m - Gradient checking for collaborative filtering +computeNumericalGradient.m - Numerically compute gradients +1 + + fmincg.m - Function minimization routine (similar to fminunc) +loadMovieList.m - Loads the list of movies into a cell-array +movie ids.txt - List of movies +normalizeRatings.m - Mean normalization for collaborative filtering +[ ] estimateGaussian.m - Estimate the parameters of a Gaussian distribution with a diagonal covariance matrix +[ ] selectThreshold.m - Find a threshold for anomaly detection +[ ] cofiCostFunc.m - Implement the cost function for collaborative filtering +indicates files you will need to complete +Throughout the first part of the exercise (anomaly detection) you will be +using the script ex8.m. For the second part of collaborative filtering, you +will use ex8 cofi.m. These scripts set up the dataset for the problems and +make calls to functions that you will write. You are only required to modify +functions in other files, by following the instructions in this assignment. + +Where to get help +We also strongly encourage using the online Q&A Forum to discuss exercises with other students. However, do not look at any source code written +by others or share your source code with others. +If you run into network errors using the submit script, you can also use +an online form for submitting your solutions. To use this alternative submission interface, run the submitWeb script to generate a submission file (e.g., +submit ex8 part2.txt). You can then submit this file through the web +submission form in the programming exercises page (go to the programming +exercises page, then select the exercise you are submitting for). If you are +having no problems submitting through the standard submission system using the submit script, you do not need to use this alternative submission +interface. + +1 + +Anomaly detection + +In this exercise, you will implement an anomaly detection algorithm to detect +anomalous behavior in server computers. The features measure the throughput (mb/s) and latency (ms) of response of each server. While your servers +2 + + were operating, you collected m = 307 examples of how they were behaving, +and thus have an unlabeled dataset {x(1) , . . . , x(m) }. You suspect that the +vast majority of these examples are “normal” (non-anomalous) examples of +the servers operating normally, but there might also be some examples of +servers acting anomalously within this dataset. +You will use a Gaussian model to detect anomalous examples in your +dataset. You will first start on a 2D dataset that will allow you to visualize +what the algorithm is doing. On that dataset you will fit a Gaussian distribution and then find values that have very low probability and hence can +be considered anomalies. After that, you will apply the anomaly detection +algorithm to a larger dataset with many dimensions. You will be using ex8.m +for this part of the exercise. +The first part of ex8.m will visualize the dataset as shown in Figure 1. +30 + +25 + +Throughput (mb/s) + +20 + +15 + +10 + +5 + +0 + +0 + +5 + +10 + +15 +Latency (ms) + +20 + +25 + +30 + +Figure 1: The first dataset. + +1.1 + +Gaussian distribution + +To perform anomaly detection, you will first need to fit a model to the data’s +distribution. +Given a training set {x(1) , ..., x(m) } (where x(i) ∈ Rn ), you want to estimate the Gaussian distribution for each of the features xi . For each feature +i = 1 . . . n, you need to find parameters µi and σi2 that fit the data in the +(1) +(m) +i-th dimension {xi , ..., xi } (the i-th dimension of each example). +3 + + The Gaussian distribution is given by +p(x; µ, σ 2 ) = √ + +1 +2πσ 2 + +e− + +(x−µ)2 +2σ 2 + +, + +where µ is the mean and σ 2 controls the variance. + +1.2 + +Estimating parameters for a Gaussian + +You can estimate the parameters, (µi , σi2 ), of the i-th feature by using the +following equations. To estimate the mean, you will use: +1 +µi = +m + +m +(j) + +(1) + +(xi − µi )2 . + +(2) + +xi , +j=1 + +and for the variance you will use: +σi2 = + +1 +m + +m +(j) + +j=1 + +Your task is to complete the code in estimateGaussian.m. This function +takes as input the data matrix X and should output an n-dimension vector +mu that holds the mean of all the n features and another n-dimension vector +sigma2 that holds the variances of all the features. You can implement this +using a for-loop over every feature and every training example (though a vectorized implementation might be more efficient; feel free to use a vectorized +implementation if you prefer). Note that in Octave, the var function will +1 +, instead of m1 , when computing σi2 . +(by default) use m−1 +Once you have completed the code in estimateGaussian.m, the next +part of ex8.m will visualize the contours of the fitted Gaussian distribution. +You should get a plot similar to Figure 2. From your plot, you can see that +most of the examples are in the region with the highest probability, while +the anomalous examples are in the regions with lower probabilities. +You should now submit your estimate Gaussian parameters function. + +1.3 + +Selecting the threshold, ε + +Now that you have estimated the Gaussian parameters, you can investigate +which examples have a very high probability given this distribution and which +examples have a very low probability. The low probability examples are +4 + + 30 + +25 + +Throughput (mb/s) + +20 + +15 + +10 + +5 + +0 + +0 + +5 + +10 + +15 +Latency (ms) + +20 + +25 + +30 + +Figure 2: The Gaussian distribution contours of the distribution fit to the +dataset. +more likely to be the anomalies in our dataset. One way to determine which +examples are anomalies is to select a threshold based on a cross validation +set. In this part of the exercise, you will implement an algorithm to select +the threshold ε using the F1 score on a cross validation set. +You should now complete the code in selectThreshold.m. For this, we +(1) (1) +(m ) (m ) +will use a cross validation set {(xcv , ycv ), . . . , (xcv cv , ycv cv )}, where the label y = 1 corresponds to an anomalous example, and y = 0 corresponds +to a normal example. For each cross validation example, we will com(m +(i) +(1) +pute p(xcv ). The vector of all of these probabilities p(xcv ), . . . , p(xcv cv) ) is +passed to selectThreshold.m in the vector pval. The corresponding labels +(m +(1) +ycv , . . . , ycv cv) is passed to the same function in the vector yval. +The function selectThreshold.m should return two values; the first is +the selected threshold ε. If an example x has a low probability p(x) < ε, +then it is considered to be an anomaly. The function should also return the +F1 score, which tells you how well you’re doing on finding the ground truth +anomalies given a certain threshold. For many different values of ε, you will +compute the resulting F1 score by computing how many examples the current +threshold classifies correctly and incorrectly. +The F1 score is computed using precision (prec) and recall (rec): +F1 = + +2 · prec · rec +, +prec + rec +5 + +(3) + + You compute precision and recall by: +tp +tp + f p +tp +, +rec = +tp + f n + +prec = + +(4) +(5) + +where +❼ tp is the number of true positives: the ground truth label says it’s an +anomaly and our algorithm correctly classified it as an anomaly. +❼ f p is the number of false positives: the ground truth label says it’s not +an anomaly, but our algorithm incorrectly classified it as an anomaly. +❼ f n is the number of false negatives: the ground truth label says it’s an +anomaly, but our algorithm incorrectly classified it as not being anomalous. + +In the provided code selectThreshold.m, there is already a loop that +will try many different values of ε and select the best ε based on the F1 score. +You should now complete the code in selectThreshold.m. You can implement the computation of the F1 score using a for-loop over all the cross +validation examples (to compute the values tp, f p, f n). You should see a +value for epsilon of about 8.99e-05. +Implementation Note: In order to compute tp, f p and f n, you may +be able to use a vectorized implementation rather than loop over all the +examples. This can be implemented by Octave’s equality test between a +vector and a single number. If you have several binary values in an ndimensional binary vector v ∈ {0, 1}n , you can find out how many values +in this vector are 0 by using: sum(v == 0). You can also apply a logical +and operator to such binary vectors. For instance, let cvPredictions be +a binary vector of the size of your number of cross validation set, where +(i) +the i-th element is 1 if your algorithm considers xcv an anomaly, and +0 otherwise. You can then, for example, compute the number of false +positives using: fp = sum((cvPredictions == 1) & (yval == 0)). +Once you have completed the code in selectThreshold.m, the next step +in ex8.m will run your anomaly detection code and circle the anomalies in +the plot (Figure 3). +You should now submit your select threshold function. +6 + + 30 + +25 + +Throughput (mb/s) + +20 + +15 + +10 + +5 + +0 + +0 + +5 + +10 + +15 +Latency (ms) + +20 + +25 + +30 + +Figure 3: The classified anomalies. + +1.4 + +High dimensional dataset + +The last part of the script ex8.m will run the anomaly detection algorithm +you implemented on a more realistic and much harder dataset. In this +dataset, each example is described by 11 features, capturing many more +properties of your compute servers. +The script will use your code to estimate the Gaussian parameters (µi and +σi2 ), evaluate the probabilities for both the training data X from which you +estimated the Gaussian parameters, and do so for the the cross-validation +set Xval. Finally, it will use selectThreshold to find the best threshold ε. +You should see a value epsilon of about 1.38e-18, and 117 anomalies found. + +2 + +Recommender Systems + +In this part of the exercise, you will implement the collaborative filtering +learning algorithm and apply it to a dataset of movie ratings.1 This dataset +consists of ratings on a scale of 1 to 5. The dataset has nu = 943 users, and +nm = 1682 movies. For this part of the exercise, you will be working with +the script ex8 cofi.m. +1 + +MovieLens 100k Dataset from GroupLens Research. + +7 + + In the next parts of this exercise, you will implement the function cofiCostFunc.m +that computes the collaborative fitlering objective function and gradient. After implementing the cost function and gradient, you will use fmincg.m to +learn the parameters for collaborative filtering. + +2.1 + +Movie ratings dataset + +The first part of the script ex8 cofi.m will load the dataset ex8 movies.mat, +providing the variables Y and R in your Octave environment. +The matrix Y (a num movies × num users matrix) stores the ratings y (i,j) +(from 1 to 5). The matrix R is an binary-valued indicator matrix, where +R(i, j) = 1 if user j gave a rating to movie i, and R(i, j) = 0 otherwise. The +objective of collaborative filtering is to predict movie ratings for the movies +that users have not yet rated, that is, the entries with R(i, j) = 0. This will +allow us to recommend the movies with the highest predicted ratings to the +user. +To help you understand the matrix Y, the script ex8 cofi.m will compute +the average movie rating for the first movie (Toy Story) and output the +average rating to the screen. +Throughout this part of the exercise, you will also be working with the +matrices, X and Theta: + + + + +— (θ(1) )T — +— (x(1) )T — + — (θ(2) )T —  + — (x(2) )T —  + + + + +X= +. + , Theta =  +.. +.. + + + + +. +. +(nu ) T +(nm ) T +— (θ ) — +— (x +) — +The i-th row of X corresponds to the feature vector x(i) for the i-th movie, +and the j-th row of Theta corresponds to one parameter vector θ(j) , for the +j-th user. Both x(i) and θ(j) are n-dimensional vectors. For the purposes of +this exercise, you will use n = 100, and therefore, x(i) ∈ R100 and θ(j) ∈ R100 . +Correspondingly, X is a nm × 100 matrix and Theta is a nu × 100 matrix. + +2.2 + +Collaborative filtering learning algorithm + +Now, you will start implementing the collaborative filtering learning algorithm. You will start by implementing the cost function (without regularization). +The collaborative filtering algorithm in the setting of movie recommendations considers a set of n-dimensional parameter vectors x(1) , ..., x(nm ) and +8 + + θ(1) , ..., θ(nu ) , where the model predicts the rating for movie i by user j as +y (i,j) = (θ(j) )T x(i) . Given a dataset that consists of a set of ratings produced +by some users on some movies, you wish to learn the parameter vectors +x(1) , ..., x(nm ) , θ(1) , ..., θ(nu ) that produce the best fit (minimizes the squared +error). +You will complete the code in cofiCostFunc.m to compute the cost function and gradient for collaborative filtering. Note that the parameters to the +function (i.e., the values that you are trying to learn) are X and Theta. In +order to use an off-the-shelf minimizer such as fmincg, the cost function has +been set up to unroll the parameters into a single vector params. You had +previously used the same vector unrolling method in the neural networks +programming exercise. +2.2.1 + +Collaborative filtering cost function + +The collaborative filtering cost function (without regularization) is given by + +J(x(1) , ..., x(nm ) , θ(1) , ..., θ(nu ) ) = + +1 +2 + +((θ(j) )T x(i) − y (i,j) )2 . +(i,j):r(i,j)=1 + +You should now modify cofiCostFunc.m to return this cost in the variable J. Note that you should be accumulating the cost for user j and movie +i only if R(i, j) = 1. +After you have completed the function, the script ex8 cofi.m will run +your cost function. You should expect to see an output of 22.22. +You should now submit your cost function. + +9 + + Implementation Note: We strongly encourage you to use a vectorized +implementation to compute J, since it will later by called many times +by the optimization package fmincg. As usual, it might be easiest to +first write a non-vectorized implementation (to make sure you have the +right answer), and the modify it to become a vectorized implementation +(checking that the vectorization steps don’t change your algorithm’s output). To come up with a vectorized implementation, the following tip +might be helpful: You can use the R matrix to set selected entries to 0. +For example, R .* M will do an element-wise multiplication between M +and R; since R only has elements with values either 0 or 1, this has the +effect of setting the elements of M to 0 only when the corresponding value +in R is 0. Hence, sum(sum(R.*M)) is the sum of all the elements of M for +which the corresponding element in R equals 1. +2.2.2 + +Collaborative filtering gradient + +Now, you should implement the gradient (without regularization). Specifically, you should complete the code in cofiCostFunc.m to return the variables X grad and Theta grad. Note that X grad should be a matrix of the +same size as X and similarly, Theta grad is a matrix of the same size as +Theta. The gradients of the cost function is given by: +∂J +(i) +∂xk + +∂J +(j) +∂θk + +(j) + +((θ(j) )T x(i) − y (i,j) )θk + += + +j:r(i,j)=1 + += + +(i) + +((θ(j) )T x(i) − y (i,j) )xk . +i:r(i,j)=1 + +Note that the function returns the gradient for both sets of variables +by unrolling them into a single vector. After you have completed the code +to compute the gradients, the script ex8 cofi.m will run a gradient check +(checkCostFunction) to numerically check the implementation of your gradients.2 If your implementation is correct, you should find that the analytical +and numerical gradients match up closely. +You should now submit your collaborative filtering gradient function. +2 + +This is similar to the numerical check that you used in the neural networks exercise. + +10 + + Implementation Note: You can get full credit for this assignment +without using a vectorized implementation, but your code will run much +more slowly (a small number of hours), and so we recommend that you +try to vectorize your implementation. +To get started, you can implement the gradient with a for-loop over movies +(for computing ∂J(i) ) and a for-loop over users (for computing ∂J(j) ). When +∂xk + +∂θk + +you first implement the gradient, you might start with an unvectorized +version, by implementing another inner for-loop that computes each element in the summation. After you have completed the gradient computation this way, you should try to vectorize your implementation (vectorize +the inner for-loops), so that you’re left with only two for-loops (one for +looping over movies to compute ∂J(i) for each movie, and one for looping +over users to compute + +∂J +(j) +∂θk + +∂xk + +for each user). + +11 + + Implementation Tip: To perform the vectorization, you might find this +helpful: You should come up with a way to compute all the derivatives +(i) +(i) +(i) +associated with x1 , x2 , . . . , xn (i.e., the derivative terms associated with +the feature vector x(i) ) at the same time. Let us define the derivatives for +the feature vector of the i-th movie as: +∂J  +(i) +∂x1 + ∂J  + ∂x(i)  + 2  + + + +(Xgrad (i, :))T =  .  = +((θ(j) )T x(i) − y (i,j) )θ(j) + ..  j:r(i,j)=1 +∂J +(i) +∂xn + +To vectorize the above expression, you can start by indexing into +Theta and Y to select only the elements of interests (that is, those with +r(i, j) = 1). Intuitively, when you consider the features for the i-th movie, +you only need to be concern about the users who had given ratings to the +movie, and this allows you to remove all the other users from Theta and Y. +Concretely, you can set idx = find(R(i, :)==1) to be a list of all the +users that have rated movie i. This will allow you to create the temporary +matrices Thetatemp = Theta(idx, :) and Ytemp = Y(i, idx) that index into +Theta and Y to give you only the set of users which have rated the i-th +movie. This will allow you to write the derivatives as: +Xgrad (i, :) = (X(i, :) ∗ ThetaTtemp − Ytemp ) ∗ Thetatemp . +(Note: The vectorized computation above returns a row-vector instead.) +After you have vectorized the computations of the derivatives with respect +to x(i) , you should use a similar method to vectorize the derivatives with +respect to θ(j) as well. +2.2.3 + +Regularized cost function + +The cost function for collaborative filtering with regularization is given by + +12 + + J(x(1) , ..., x(nm ) , θ(1) , ..., θ(nu ) ) = + +1 +2 + +((θ(j) )T x(i) − y (i,j) )2 + +(i,j):r(i,j)=1 + +λ +2 + +nu + +n +(j) +(θk )2 + ++ + +j=1 k=1 + +λ +2 + +nm + +n +(i) + +(xk )2 . +i=1 k=1 + +You should now add regularization to your original computations of the +cost function, J. After you are done, the script ex8 cofi.m will run your +regularized cost function, and you should expect to see a cost of about 31.34. +You should now submit your regularized cost function. +2.2.4 + +Regularized gradient + +Now that you have implemented the regularized cost function, you should +proceed to implement regularization for the gradient. You should add to +your implementation in cofiCostFunc.m to return the regularized gradient +by adding the contributions from the regularization terms. Note that the +gradients for the regularized cost function is given by: +∂J +(i) +∂xk + +∂J +(j) +∂θk + +(j) + +(i) + +(i) + +(j) + +((θ(j) )T x(i) − y (i,j) )θk + λxk + += + +j:r(i,j)=1 + += + +((θ(j) )T x(i) − y (i,j) )xk + λθk . +i:r(i,j)=1 + +This means that you just need to add λx(i) to the X grad(i,:) variable +described earlier, and add λθ(j) to the Theta grad(j,:) variable described +earlier. +After you have completed the code to compute the gradients, the script +ex8 cofi.m will run another gradient check (checkCostFunction) to numerically check the implementation of your gradients. +You should now submit the regularized gradient function. + +2.3 + +Learning movie recommendations + +After you have finished implementing the collaborative filtering cost function +and gradient, you can now start training your algorithm to make movie +13 + + recommendations for yourself. In the next part of the ex8 cofi.m script, +you can enter your own movie preferences, so that later when the algorithm +runs, you can get your own movie recommendations! We have filled out +some values according to our own preferences, but you should change this +according to your own tastes. The list of all movies and their number in the +dataset can be found listed in the file movie idx.txt. +2.3.1 + +Recommendations + +Top recommendations for you: +Predicting rating 9.0 for movie +Predicting rating 8.9 for movie +Predicting rating 8.8 for movie +Predicting rating 8.5 for movie +Predicting rating 8.5 for movie +Predicting rating 8.5 for movie +Predicting rating 8.5 for movie +Predicting rating 8.4 for movie +Predicting rating 8.4 for movie +Predicting rating 8.4 for movie + +Titanic (1997) +Star Wars (1977) +Shawshank Redemption, The (1994) +As Good As It Gets (1997) +Good Will Hunting (1997) +Usual Suspects, The (1995) +Schindler’s List (1993) +Raiders of the Lost Ark (1981) +Empire Strikes Back, The (1980) +Braveheart (1995) + +Original ratings provided: +Rated 4 for Toy Story (1995) +Rated 3 for Twelve Monkeys (1995) +Rated 5 for Usual Suspects, The (1995) +Rated 4 for Outbreak (1995) +Rated 5 for Shawshank Redemption, The (1994) +Rated 3 for While You Were Sleeping (1995) +Rated 5 for Forrest Gump (1994) +Rated 2 for Silence of the Lambs, The (1991) +Rated 4 for Alien (1979) +Rated 5 for Die Hard 2 (1990) +Rated 5 for Sphere (1998) +Figure 4: Movie recommendations +After the additional ratings have been added to the dataset, the script +will proceed to train the collaborative filtering model. This will learn the +parameters X and Theta. To predict the rating of movie i for user j, you need +14 + + to compute (θ(j) )T x(i) . The next part of the script computes the ratings for +all the movies and users and displays the movies that it recommends (Figure +4), according to ratings that were entered earlier in the script. Note that +you might obtain a different set of the predictions due to different random +initializations. + +Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Submitted File +estimateGuassian.m +selectThreshold.m +cofiCostFunc.m +cofiCostFunc.m +cofiCostFunc.m +cofiCostFunc.m + +Part +Estimate Gaussian Parameters +Select Threshold +Collaborative Filtering Cost +Collaborative Filtering Gradient +Regularized Cost +Gradient with regularization +Total Points + +Points +15 points +15 points +20 points +30 points +10 points +10 points +100 points + +You are allowed to submit your solutions multiple times, and we will take +only the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, December 11th +at 23:59:59 PDT. + +15 + + \ No newline at end of file diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/checkCostFunction.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/checkCostFunction.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/checkCostFunction.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/checkCostFunction.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/cofiCostFunc.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/cofiCostFunc.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/cofiCostFunc.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/cofiCostFunc.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/computeNumericalGradient.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/computeNumericalGradient.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/computeNumericalGradient.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/computeNumericalGradient.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/estimateGaussian.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/estimateGaussian.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/estimateGaussian.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/estimateGaussian.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/ex8.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/ex8.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/ex8_cofi.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8_cofi.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/ex8_cofi.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8_cofi.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/ex8_movieParams.mat b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8_movieParams.mat similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/ex8_movieParams.mat rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8_movieParams.mat diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/ex8_movies.mat b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8_movies.mat similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/ex8_movies.mat rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8_movies.mat diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/ex8data1.mat b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8data1.mat similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/ex8data1.mat rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8data1.mat diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/ex8data2.mat b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8data2.mat similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/ex8data2.mat rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/ex8data2.mat diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/fmincg.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/fmincg.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/fmincg.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/fmincg.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/loadMovieList.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/loadMovieList.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/loadMovieList.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/loadMovieList.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/movie_ids.txt b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/movie_ids.txt similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/movie_ids.txt rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/movie_ids.txt diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/multivariateGaussian.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/multivariateGaussian.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/multivariateGaussian.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/multivariateGaussian.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/normalizeRatings.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/normalizeRatings.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/normalizeRatings.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/normalizeRatings.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/selectThreshold.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/selectThreshold.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/selectThreshold.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/selectThreshold.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/submit.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/submit.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/submit.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/submit.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/submitWeb.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/submitWeb.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/submitWeb.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/submitWeb.m diff --git a/Anomaly Detection and Recommender Systems/mlclass-ex8/visualizeFit.m b/AnomalyDetectionandRecommenderSystems/mlclass-ex8/visualizeFit.m similarity index 100% rename from Anomaly Detection and Recommender Systems/mlclass-ex8/visualizeFit.m rename to AnomalyDetectionandRecommenderSystems/mlclass-ex8/visualizeFit.m diff --git a/DecisionTrees &Boosting/.dtree.py.swp b/DecisionTrees &Boosting/.dtree.py.swp deleted file mode 100644 index d28a22e..0000000 Binary files a/DecisionTrees &Boosting/.dtree.py.swp and /dev/null differ diff --git a/DecisionTrees &Boosting/.testdtree.py.swp b/DecisionTrees &Boosting/.testdtree.py.swp deleted file mode 100644 index 7da09f7..0000000 Binary files a/DecisionTrees &Boosting/.testdtree.py.swp and /dev/null differ diff --git a/DecisionTrees &Boosting/dtree.py b/DecisionTrees &Boosting/dtree.py deleted file mode 100755 index d44809b..0000000 --- a/DecisionTrees &Boosting/dtree.py +++ /dev/null @@ -1,722 +0,0 @@ -#!/usr/bin/env python - -""" -dtree.py -- CS181 Assignment 1: Decision Trees - -Implements decision trees, decision stumps, decision tree pruning, and -adaptive boosting. -""" - -import math - -import random - -def log2(dbl): - return math.log(dbl)/math.log(2.0) if dbl > 0.0 else 0.0 - -class Instance(object): - """Describes a piece of data. The features are contained in listAttrs, - the instance label in fLabel, and the instance weight (for use in boosting) - in dblWeight.""" - def __init__(self, listAttrs, fLabel=None, dblWeight=1.0): - self.listAttrs = listAttrs - self.fLabel = fLabel - self.dblWeight = dblWeight - def copy(self): - return Instance(list(self.listAttrs), self.fLabel, self.dblWeight) - def __repr__(self): - """This function is called when you 'print' an instance.""" - if self.dblWeight == 1.0: - return "Instance(%r, %r)" % (self.listAttrs, self.fLabel) - return ("Instance(%r, %r, %.2f)" - % (self.listAttrs, self.fLabel, self.dblWeight)) - -def compute_entropy(dblWeightTrue,dblWeightFalse): - """ Given the total weight of true instances and the total weight - of false instances in a collection, return the entropy of this collection. - >>> compute_entropy(0.0,1000.0) - -0.0 - >>> compute_entropy(0.0001, 0.0) - -0.0 - >>> compute_entropy(1,1) - 1.0""" - - P = 1.0 * dblWeightTrue / (dblWeightTrue + dblWeightFalse) - entropy = -(P * log2(P) + (1 - P) * log2(1 - P)) - - return entropy - - -def separate_by_attribute(listInst, ixAttr): - """Build a dictionary mapping attribute values to lists of instances. - - >>> separate_by_attribute([Instance([5,0],True),Instance([9,0],True)], 0) - {9: [Instance([9, 0], True)], 5: [Instance([5, 0], True)]}""" - - dictInst = {} - for inst in listInst: - # print inst , ixAttr - featureValue = inst.listAttrs[ixAttr] - if featureValue not in dictInst: - dictInst[featureValue] = [] - dictInst[featureValue].append(inst) - - return dictInst - - - -def compute_entropy_of_split(dictInst): - """Compute the average entropy of a mapping of attribute values to lists - of instances. - The average should be weighted by the sum of the weight in each list of - instances. - >>> listInst0 = [Instance([],True,0.5), Instance([],False,0.5)] - >>> listInst1 = [Instance([],False,3.0), Instance([],True,0.0)] - >>> dictInst = {0: listInst0, 1: listInst1} - >>> compute_entropy_of_split(dictInst) - 0.25""" - - wTotal = 0 - weightEntropy = 0 - for values in dictInst.values(): - wt = sum(map(lambda inst : inst.dblWeight if inst.fLabel else 0, values)) - wf = sum(map(lambda inst : inst.dblWeight if not inst.fLabel else 0,values)) - w = wt + wf - weightEntropy += w * compute_entropy(wt , wf) - wTotal += w - #print entropy , instNum , posInstNum , negInstNum - - return 1.0 * weightEntropy / wTotal - -def compute_list_entropy(listInst): - return compute_entropy_of_split({None:listInst}) - -def choose_split_attribute(iterableIxAttr, listInst, dblMinGain=0.0): - """Given an iterator over attributes, choose the attribute which - maximimizes the information gain of separating a collection of - instances based on that attribute. - Returns a tuple of (the integer best attribute, a dictionary of the - separated instances). - If the best information gain is less than dblMinGain, then return the - pair (None,None). - >>> listInst = [Instance([0,0],False), Instance([0,1],True)] - >>> choose_split_attribute([0,1], listInst) - (1, {0: [Instance([0, 0], False)], 1: [Instance([0, 1], True)]})""" - - - entropy = compute_list_entropy(listInst) - - infoGainList = [] - for ixAttr in iterableIxAttr: - dictInst = separate_by_attribute(listInst , ixAttr) - expEntropy = compute_entropy_of_split(dictInst) - infoGain = entropy - expEntropy - infoGainList.append((infoGain , ixAttr , dictInst)) - - infoGainList = sorted(infoGainList , reverse = 1) - - #print infoGainList[0][0] - - if infoGainList[0][0] < dblMinGain: - return (None , None) - return (infoGainList[0][1] , infoGainList[0][2]) - - - -def check_for_common_label(listInst): - """Return the boolean label shared by all instances in the given list of - instances, or None if no such label exists - - >>> check_for_common_label([Instance([],True), Instance([],True)]) - True - >>> check_for_common_label([Instance([],False), Instance([],False)]) - False - >>> check_for_common_label([Instance([],True), Instance([],False)])""" - - instNum = len(listInst) - posNum = len([inst for inst in listInst if inst.fLabel == True]) - if posNum == instNum: return True - elif posNum == 0: return False - return None - - - -def majority_label(listInst): - """Return the boolean label with the most weight in the given list of - instances. - - >>> majority_label([Instance([],True,1.0),Instance([],False,0.75)]) - True - >>> listInst =[Instance([],False),Instance([],True),Instance([],False)] - >>> majority_label(listInst) - False""" - - posWeight = 0.0 - negWeight = 0.0 - for inst in listInst: - if inst.fLabel == True: posWeight += inst.dblWeight - else: negWeight += inst.dblWeight - - return True if posWeight > negWeight else False - - - -class DTree(object): - def __init__(self, fLabel=None, ixAttr=None, fDefaultLabel=None): - if fLabel is None and ixAttr is None: - raise TypeError("DTree must be given a label or an attribute," - " but received neither.") - self.fLabel = fLabel - self.ixAttr = ixAttr - self.dictChildren = {} - self.fDefaultLabel = fDefaultLabel - if self.is_node() and self.fDefaultLabel is None: - raise TypeError("Nodes require a valid fDefaultLabel") - def is_leaf(self): - return self.fLabel is not None - def is_node(self): - return self.ixAttr is not None - def add(self, dtChild, v): - if not isinstance(dtChild,self.__class__): - raise TypeError("dtChild was not a DTree") - if v in self.dictChildren: - raise ValueError("Attempted to add a child with" - " an existing attribute value.") - self.dictChildren[v] = dtChild - def convert_to_leaf(self): - if self.is_leaf(): - return - self.fLabel = self.fDefaultLabel - self.ixAttr = None - self.fDefaultLabel = None - self.dictChildren = {} - # the following methods are used in testing -- you should need - # to worry about them - def copy(self): - if self.is_leaf(): - return DTree(fLabel=self.fLabel) - dt = DTree(ixAttr=self.ixAttr, fDefaultLabel=self.fDefaultLabel) - for ixValue,dtChild in self.dictChildren.iteritems(): - dt.add(dtChild.copy(),ixValue) - return dt - def _append_repr(self,listRepr): - if self.is_leaf(): - listRepr.append("[%s]" % str(self.fLabel)[0]) - else: - sDefaultLabel = str(self.fDefaultLabel)[0] - listRepr.append("<%d,%s,{" % (self.ixAttr, sDefaultLabel)) - for dtChild in self.dictChildren.values(): - dtChild._append_repr(listRepr) - listRepr.append("}>") - def __repr__(self): - listRepr = [] - self._append_repr(listRepr) - return "".join(listRepr) - -def build_tree_rec(setIxAttr, listInst, dblMinGain, cRemainingLevels): - - """Recursively build a decision tree. - - Given a set of integer attributes, a list of instances, a boolean default - label, and a floating-point valued minimum information gain, create - a decision tree leaf or node. - - If there is a common label across all instances in listInst, the function - returns a leaf node with this common label. - - If setIxAttr is empty, the function returns a leaf with the majority label - across listInst. - - If cRemainingLevels is zero, return the majority label. (If - cRemainingLevels is less than zero, then we don't want to do anything - special -- this is our mechanism for ignoring the tree depth limit). - If no separation of the instances yields an information gain greater than - dblMinGain, the function returns a leaf with the majority label across - listInst. - - Otherwise, the function finds the attribute which maximizes information - gain, splits on the attribute, and continues building the tree - recursively. - - When building tree nodes, the function specifies the majority label across - listInst as the node's default label (fDefaultLabel argument to DTree's - __init__). This will be useful in pruning.""" - - - majorityLabel = majority_label(listInst) - if len(setIxAttr) == 0: - return DTree(fLabel = majorityLabel) - if cRemainingLevels == 0: - return DTree(fLabel = majorityLabel) - - commonLabel = check_for_common_label(listInst) - if commonLabel is not None: - return DTree(fLabel = commonLabel) - - ixChosen , dictBest = choose_split_attribute(setIxAttr , listInst , dblMinGain) - if ixChosen is None: - return DTree(fLabel = majorityLabel) - - dt = DTree(ixAttr = ixChosen , fDefaultLabel = majorityLabel) - subsetIxAttr = set(setIxAttr) - set([ixChosen]) - #print subsetIxAttr - for value , attrList in dictBest.items(): - dtChild = build_tree_rec(subsetIxAttr , attrList , dblMinGain , cRemainingLevels - 1) - dt.add(dtChild , value) - - return dt - - - -def count_instance_attributes(listInst): - """Return the number of attributes across all instances, or None if the - instances differ in the number of attributes they contain. - - >>> listInst = [Instance([1,2,3],True), Instance([4,5,6],False)] - >>> count_instance_attributes(listInst) - 3 - >>> count_instance_attributes([Instance([1,2],True),Instance([3],False)]) - """ - countAttr = len(listInst[0].listAttrs) - for inst in listInst: - if countAttr != len(inst.listAttrs): - return None - return countAttr - - - -def build_tree(listInst, dblMinGain=0.0, cMaxLevel=-1): - """Build a decision tree with the ID3 algorithm from a list of - instances.""" - cAttr = count_instance_attributes(listInst) - if cAttr is None: - raise TypeError("Instances provided have attribute lists of " - "varying lengths.") - setIxAttr = set(xrange(cAttr)) - return build_tree_rec(setIxAttr, listInst, dblMinGain, cMaxLevel) - -def classify(dt, inst): - """Using decision tree dt, return the label for instance inst.""" - - if dt.is_leaf(): - return dt.fLabel - value = inst.listAttrs[dt.ixAttr] - if value not in dt.dictChildren: - return dt.fDefaultLabel - return classify(dt.dictChildren[value] , inst) - - - -class EvaluationResult(object): - def __init__(self, listInstCorrect, listInstIncorrect, oClassifier): - self.listInstCorrect = listInstCorrect - self.listInstIncorrect = listInstIncorrect - self.oClassifier = oClassifier - -def weight_correct_incorrect(rslt): - """Return a pair of floating-point numbers denoting the weight of - (correct, incorrect) instances in EvaluationResult rslt. - - >>> listInstCorrect = [Instance([],True,0.25)] - >>> listInstIncorrect = [Instance([],False,0.50)] - >>> rslt = EvaluationResult(listInstCorrect, listInstIncorrect, None) - >>> weight_correct_incorrect(rslt) - (0.25, 0.5)""" - - correctInst = sum([inst.dblWeight for inst in rslt.listInstCorrect]) - incorrectInst = sum([inst.dblWeight for inst in rslt.listInstIncorrect]) - return (correctInst , incorrectInst) - - - -class CrossValidationFold(object): - """Abstract base class for all cross validaiton fold types.""" - def build(self): - # abstract method - raise NotImplemented - def classify(self, dt, inst): - # abstract method - raise NotImplemented - def check_insts(self, listInst): - for inst in (listInst or []): - if inst.fLabel is None: - raise TypeError("missing instance label") - return listInst - -class TreeFold(CrossValidationFold): - def __init__(self, listInstTraining, listInstTest, listInstValidate=None): - super(TreeFold,self).__init__() - self.listInstTraining = self.check_insts(listInstTraining) - self.listInstTest = self.check_insts(listInstTest) - self.listInstValidate = self.check_insts(listInstValidate) - self.cMaxLevel = -1 - def build(self): - return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) - def classify(self, dt, inst): - return classify(dt,inst) - -def evaluate_classification(cvf): - """Given a CrossValidationFold, build a classifier and build an - EvaluationResult that correctly partitions test instances into a list of - correctly and incorrectly classified instances. - - Classifiers can be built using cvf.build(). - Evaluation results are built with - EvaluationResult(listInstCorrect,listInstIncorrect,dt) - where dt is the classifier built with cvf.build().""" - - dt = cvf.build() - listInstCorrect = [] - listInstIncorrect = [] - for inst in cvf.listInstTest: - # print cvf.classify(dt , inst) , inst - if cvf.classify(dt , inst) == inst.fLabel: - listInstCorrect.append(inst) - else: - listInstIncorrect.append(inst) - - return EvaluationResult(listInstCorrect , listInstIncorrect , dt) - - - -def check_folds(listInst, cFold, cMinFold): - """Raise a ValueError if cFold is greater than the number of instances, or - if cFold is less than the minimum number of folds. - -# >>> check_folds([Instance([],True), Instance([],False)], 1, 2) -# >>> check_folds([Instance([],True)], 2, 1) - Traceback (most recent call last): - ... - ValueError: Cannot have more folds than instances -# >>> check_folds([Instance([],False)], 1, 2) - Traceback (most recent call last): - ... - ValueError: Need at least 2 folds.""" - - - if cFold > len(listInst): - raise ValueError("Cannot have more folds than instances") - if cFold < cMinFold: - raise ValueError("'Need at least %d folds' % (cMinFold)") - - return - - -def yield_cv_folds(listInst, cFold): - """Yield a series of TreeFolds, which represent a partition of listInst - into cFold folds. - - You may either return a list, or `yield` (http://goo.gl/gwOfM) - TreeFolds one at a time.""" - - check_folds(listInst, cFold, 2) - - listInstSize = len(listInst) - cFoldSize = int(math.ceil(listInstSize / cFold)) - -# folds = [] -# for i in range(cFold): -# if i == cFold - 1: -# folds.append(listInst[i * cFoldSize : listInstSize]) -# else: -# folds.append(listInst[i * cFoldSize : (i + 1) * cFoldSize]) - -# for i in range(cFold): -# listInstTest = folds[i] -# listInstTraining = [] -# for j in range(cFold): -# if i == j: continue -# listInstTraining += folds[j] -# -# #print len(listInstTest) , len(listInstTraining) -# yield TreeFold(listInstTraining , listInstTest) - - for i in range(cFold): - id1 = i * cFoldSize - id2 = min(listInstSize , (i + 1) * cFoldSize) - listInstTest = listInst[id1 : id2] - listInstTraining = listInst[:id1] - listInstTraining.extend(listInst[id2:]) - yield TreeFold(listInstTraining , listInstTest) - - -def cv_score(iterableFolds): - """Determine the fraction (by weight) of correct instances across a number - of cross-validation folds.""" - - correct = 0.0 - incorrect = 0.0 - for cvf in iterableFolds: - result = evaluate_classification(cvf) - correctWeight, incorrectWeight = weight_correct_incorrect(result) - correct += correctWeight - incorrect += incorrectWeight - - return correct / (correct + incorrect) - -def prune_tree(dt, listInst): - """Recursively prune a decision tree. - Given a subtree to prune and a list of instances, - recursively prune the tree, then determine if the current node should - become a leaf. - - The function does not return anything, and instead modifies the tree - in-place.""" - - score = 0.0 - prunedScore = 0.0 - if dt.is_leaf(): return - - dictInst = separate_by_attribute(listInst , dt.ixAttr) - for key , child in dt.dictChildren.items(): - if key not in dictInst: continue - prune_tree(child , dictInst[key]) - - for inst in listInst: - if classify(dt , inst) == inst.fLabel: - score += inst.dblWeight - if dt.fDefaultLabel == inst.fLabel: - prunedScore += inst.dblWeight - - if prunedScore >= score: - dt.convert_to_leaf() - - return - -def build_pruned_tree(listInstTrain, listInstValidate): - - """Build a pruned decision tree from a list of training instances, then - prune the tree using a list of validation instances. - - Return the pruned decision tree.""" - - dt = build_tree(listInstTrain) - prune_tree(dt , listInstValidate) - return dt - -class PrunedFold(TreeFold): - def __init__(self, *args, **kwargs): - super(PrunedFold,self).__init__(*args,**kwargs) - if self.listInstValidate is None: - raise TypeError("PrunedCrossValidationFold requires " - "listInstValidate argument.") - def build(self): - return build_pruned_tree(self.listInstTraining,self.listInstValidate) - -def yield_cv_folds_with_validation(listInst, cFold): - """Yield a number cFold of PrunedFolds, which together form a partition of - the list of instances listInst. - - You may either return a list or yield successive values.""" - - check_folds(listInst, cFold, 3) - listInstSize = len(listInst) - cFoldSize = int(math.ceil(listInstSize / cFold)) - #print cFold - for i in range(cFold): - id1 = i * cFoldSize - id2 = min(listInstSize , (i + 1) * cFoldSize) - listInstTest = listInst[id1 : id2] - if id2 == listInstSize: - listInstValidation = listInst[0:cFoldSize] - listInstTraining = listInst[cFoldSize:id1] - else: - id3 = min(listInstSize , id2 + cFoldSize) - listInstValidation = listInst[id2:id3] - listInstTraining = listInst[:id1] - listInstTraining.extend(listInst[id3:]) - yield PrunedFold(listInstTraining , listInstTest , listInstValidation) - - -def normalize_weights(listInst): - """Normalize the weights of all the instances in listInst so that the sum - of their weights totals to 1.0. - - The function modifies the weights of the instances in-place and does - not return anything. - - >>> listInst = [Instance([],True,0.1), Instance([],False,0.3)] - >>> normalize_weights(listInst) - >>> print listInst - [Instance([], True, 0.25), Instance([], False, 0.75)]""" - - wTotal = sum(map(lambda inst : inst.dblWeight , listInst)) - - for inst in listInst: - inst.dblWeight /= wTotal - -def init_weights(listInst): - """Initialize the weights of the instances in listInst so that each - instance has weight 1/(number of instances). This function modifies - the weights in place and does not return anything. - - >>> listInst = [Instance([],True,0.5), Instance([],True,0.25)] - >>> init_weights(listInst) - >>> print listInst - [Instance([], True, 0.50), Instance([], True, 0.50)]""" - - nTotal = len(listInst) - for inst in listInst: - inst.dblWeight = 1.0 / nTotal - return - -def classifier_error(rslt): - """Given and evaluation result, return the (floating-point) fraction - of correct instances by weight. - - >>> listInstCorrect = [Instance([],True,0.15)] - >>> listInstIncorrect = [Instance([],True,0.45)] - >>> rslt = EvaluationResult(listInstCorrect,listInstIncorrect,None) - >>> classifier_error(rslt) - 0.75""" - - correctWeights = sum(map(lambda inst : inst.dblWeight , rslt.listInstCorrect)) - inCorrectWeights = sum(map(lambda inst : inst.dblWeight , rslt.listInstIncorrect)) - return 1.0 * inCorrectWeights / (inCorrectWeights + correctWeights) - - - - -def classifier_weight(dblError): - """Return the classifier weight alpha from the classifier's training - error.""" - - return 0.5 * math.log((1 - dblError) / dblError) - - -def update_weight_unnormalized(inst, dblClassifierWeight, fClassifiedLabel): - """Re-weight an instance given the classifier weight, and the label - assigned to the instance by the classifier. This function acts in place - and does not return anything.""" - - if inst.fLabel != fClassifiedLabel: - inst.dblWeight *= math.pow(math.e , dblClassifierWeight) - else: - inst.dblWeight *= math.pow(math.e , -dblClassifierWeight) - - -class StumpFold(TreeFold): - def __init__(self, listInstTraining, cMaxLevel=1): - self.listInstTraining = listInstTraining - self.listInstTest = listInstTraining - self.cMaxLevel = cMaxLevel - def build(self): - return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) - -def one_round_boost(listInst, cMaxLevel): - """Conduct a single round of boosting on a list of instances. Returns a - triple (classifier, error, classifier weight). - - Implementation suggestion: - - build a StumpFold from the list of instances and the given - cMaxLevel (it's obnoxious that cMaxLevel has to be passed around - like this -- just pass it into Stumpfold() as the second argument - and you should be fine). - - using the StumpFold, build an EvaluationResult using - evaluate_classification - - get the error rate of the EvaluationResult using classifier_error - - obtain the classifier weight from the classifier error - - update the weight of all instances in the evaluation results - - normalize all weights - - return the EvaluationResult's oClassifier member, the classifier error, - and the classifier weight in a 3-tuple - - remember to return early if the error is zero.""" - - stump = StumpFold(listInst , cMaxLevel = cMaxLevel) - result = evaluate_classification(stump) - error = classifier_error(result) - if error == 0: - return result.oClassifier , 0 , 1 - classifierWeight = classifier_weight(error) - for inst in listInst: - update_weight_unnormalized(inst , classifierWeight , classify(result.oClassifier , inst)) - - normalize_weights(listInst) - - return (result.oClassifier , error , classifierWeight) - - -class BoostResult(object): - def __init__(self, listDblCferWeight, listCfer): - self.listDblCferWeight = listDblCferWeight - self.listCfer = listCfer - -def boost(listInst, cMaxRounds=50, cMaxLevel=1): - """Conduct up to cMaxRounds of boosting on training instances listInst - and return a BoostResult containing the classifiers and their weights.""" - - listCfer = [] - listDblCferWeight = [] - for iterRound in range(cMaxRounds): - (classifier , error , classifierWeight) = one_round_boost(listInst , cMaxLevel) - listCfer.append(classifier) - listDblCferWeight.append(classifierWeight) - - return BoostResult(listDblCferWeight , listCfer) - -def classify_boosted(br,inst): - """Given a BoostResult and an instance, return the (boolean) label - predicted for the instance by the boosted classifier.""" - - res = 0 - for i in range(len(br.listCfer)): - fClassifiedLabel = classify(br.listCfer[i] , inst) - 0.5 - res += fClassifiedLabel * br.listDblCferWeight[i] - - return True if res >= 0 else False - - -class BoostedFold(TreeFold): - def __init__(self, *args, **kwargs): - super(BoostedFold,self).__init__(*args, **kwargs) - self.cMaxLevel = 1 - self.cMaxRounds = 50 - def build(self): - listInst = [inst.copy() for inst in self.listInstTraining] - return boost(listInst, self.cMaxRounds, self.cMaxLevel) - def classify(self, br, inst): - return classify_boosted(br, inst) - -def yield_boosted_folds(listInst, cFold): - """Yield a number cFold of BoostedFolds, constituting a partition of - listInst. - - Implementation suggestion: Generate TreeFolds, and yield BoostedFolds - built from your TreeFolds.""" - boostedFolds = [] - folds = yield_cv_folds(listInst , cFold) - for fold in folds: - boostedFolds.append(BoostedFold(fold.listInstTraining , fold.listInstTest)) - - return boostedFolds - - -def read_csv_dataset(infile): - listInst = [] - for sRow in infile: - listRow = map(int, sRow.strip().split()) - inst = Instance(map(int,listRow[:-1]), bool(listRow[-1])) - listInst.append(inst) - return listInst - -def load_csv_dataset(oFile): - if isinstance(oFile,basestring): - with open(oFile) as infile: return read_csv_dataset(infile) - return read_csv_dataset(infile) - -def main(argv): - import doctest - doctest.testmod() - listInst = load_csv_dataset("data.csv") - cFold = 10 - iterableFolds = yield_cv_folds_with_validation(listInst,cFold) - #iterableFolds = yield_cv_folds(listInst,cFold) - #iterableFolds = yield_boosted_folds(listInst,cFold) - print "%.2f%% correct" % (100.0*cv_score(iterableFolds)) - return 0 - - - -if __name__ == "__main__": - import doctest - doctest.testmod() diff --git a/DecisionTrees &Boosting/dtree.pyc b/DecisionTrees &Boosting/dtree.pyc deleted file mode 100644 index 4e721b2..0000000 Binary files a/DecisionTrees &Boosting/dtree.pyc and /dev/null differ diff --git a/DecisionTrees &Boosting/dtree1.py b/DecisionTrees &Boosting/dtree1.py deleted file mode 100755 index 6406220..0000000 --- a/DecisionTrees &Boosting/dtree1.py +++ /dev/null @@ -1,694 +0,0 @@ -#!/usr/bin/env python - -""" -dtree.py -- CS181 Assignment 1: Decision Trees - -Implements decision trees, decision stumps, decision tree pruning, and -adaptive boosting. -""" - -import math - -import random - -def log2(dbl): - return math.log(dbl)/math.log(2.0) if dbl > 0.0 else 0.0 - -class Instance(object): - """Describes a piece of data. The features are contained in listAttrs, - the instance label in fLabel, and the instance weight (for use in boosting) - in dblWeight.""" - def __init__(self, listAttrs, fLabel=None, dblWeight=1.0): - self.listAttrs = listAttrs - self.fLabel = fLabel - self.dblWeight = dblWeight - def copy(self): - return Instance(list(self.listAttrs), self.fLabel, self.dblWeight) - def __repr__(self): - """This function is called when you 'print' an instance.""" - if self.dblWeight == 1.0: - return "Instance(%r, %r)" % (self.listAttrs, self.fLabel) - return ("Instance(%r, %r, %.2f)" - % (self.listAttrs, self.fLabel, self.dblWeight)) - -def compute_entropy(dblWeightTrue,dblWeightFalse): - """ Given the total weight of true instances and the total weight - of false instances in a collection, return the entropy of this collection. - >>> compute_entropy(0.0,1000.0) - -0.0 - >>> compute_entropy(0.0001, 0.0) - -0.0 - >>> compute_entropy(1,1) - 1.0""" - - P = 1.0 * dblWeightTrue / (dblWeightTrue + dblWeightFalse) - entropy = -(P * log2(P) + (1 - P) * log2(1 - P)) - - return entropy - - -def separate_by_attribute(listInst, ixAttr): - """Build a dictionary mapping attribute values to lists of instances. - - >>> separate_by_attribute([Instance([5,0],True),Instance([9,0],True)], 0) - {9: [Instance([9, 0], True)], 5: [Instance([5, 0], True)]}""" - - dictInst = {} - for inst in listInst: - # print inst , ixAttr - featureValue = inst.listAttrs[ixAttr] - if featureValue not in dictInst: - dictInst[featureValue] = [] - dictInst[featureValue].append(inst) - - return dictInst - - - -def compute_entropy_of_split(dictInst): - """Compute the average entropy of a mapping of attribute values to lists - of instances. - The average should be weighted by the sum of the weight in each list of - instances. - >>> listInst0 = [Instance([],True,0.5), Instance([],False,0.5)] - >>> listInst1 = [Instance([],False,3.0), Instance([],True,0.0)] - >>> dictInst = {0: listInst0, 1: listInst1} - >>> compute_entropy_of_split(dictInst) - 0.25""" - - wTotal = 0 - weightEntropy = 0 - for values in dictInst.values(): - wt = sum(map(lambda inst : inst.dblWeight if inst.fLabel else 0, values)) - wf = sum(map(lambda inst : inst.dblWeight if not inst.fLabel else 0,values)) - w = wt + wf - weightEntropy += w * compute_entropy(wt , wf) - wTotal += w - #print entropy , instNum , posInstNum , negInstNum - - return 1.0 * weightEntropy / wTotal - -def compute_list_entropy(listInst): - return compute_entropy_of_split({None:listInst}) - -def choose_split_attribute(iterableIxAttr, listInst, dblMinGain=0.0): - """Given an iterator over attributes, choose the attribute which - maximimizes the information gain of separating a collection of - instances based on that attribute. - Returns a tuple of (the integer best attribute, a dictionary of the - separated instances). - If the best information gain is less than dblMinGain, then return the - pair (None,None). - >>> listInst = [Instance([0,0],False), Instance([0,1],True)] - >>> choose_split_attribute([0,1], listInst) - (1, {0: [Instance([0, 0], False)], 1: [Instance([0, 1], True)]})""" - - - entropy = compute_list_entropy(listInst) - - infoGainList = [] - for ixAttr in iterableIxAttr: - dictInst = separate_by_attribute(listInst , ixAttr) - expEntropy = compute_entropy_of_split(dictInst) - infoGain = entropy - expEntropy - infoGainList.append((infoGain , ixAttr , dictInst)) - - infoGainList = sorted(infoGainList , reverse = 1) - - #print infoGainList[0][0] - - if infoGainList[0][0] < dblMinGain: - return (None , None) - return (infoGainList[0][1] , infoGainList[0][2]) - - - -def check_for_common_label(listInst): - """Return the boolean label shared by all instances in the given list of - instances, or None if no such label exists - - >>> check_for_common_label([Instance([],True), Instance([],True)]) - True - >>> check_for_common_label([Instance([],False), Instance([],False)]) - False - >>> check_for_common_label([Instance([],True), Instance([],False)])""" - - instNum = len(listInst) - posNum = len([inst for inst in listInst if inst.fLabel == True]) - if posNum == instNum: return True - elif posNum == 0: return False - return None - - - -def majority_label(listInst): - """Return the boolean label with the most weight in the given list of - instances. - - >>> majority_label([Instance([],True,1.0),Instance([],False,0.75)]) - True - >>> listInst =[Instance([],False),Instance([],True),Instance([],False)] - >>> majority_label(listInst) - False""" - - posWeight = 0.0 - negWeight = 0.0 - for inst in listInst: - if inst.fLabel == True: posWeight += inst.dblWeight - else: negWeight += inst.dblWeight - - return True if posWeight >= negWeight else False - - - -class DTree(object): - def __init__(self, fLabel=None, ixAttr=None, fDefaultLabel=None): - if fLabel is None and ixAttr is None: - raise TypeError("DTree must be given a label or an attribute," - " but received neither.") - self.fLabel = fLabel - self.ixAttr = ixAttr - self.dictChildren = {} - self.fDefaultLabel = fDefaultLabel - if self.is_node() and self.fDefaultLabel is None: - raise TypeError("Nodes require a valid fDefaultLabel") - def is_leaf(self): - return self.fLabel is not None - def is_node(self): - return self.ixAttr is not None - def add(self, dtChild, v): - if not isinstance(dtChild,self.__class__): - raise TypeError("dtChild was not a DTree") - if v in self.dictChildren: - raise ValueError("Attempted to add a child with" - " an existing attribute value.") - self.dictChildren[v] = dtChild - def convert_to_leaf(self): - if self.is_leaf(): - return - self.fLabel = self.fDefaultLabel - self.ixAttr = None - self.fDefaultLabel = None - self.dictChildren = {} - # the following methods are used in testing -- you should need - # to worry about them - def copy(self): - if self.is_leaf(): - return DTree(fLabel=self.fLabel) - dt = DTree(ixAttr=self.ixAttr, fDefaultLabel=self.fDefaultLabel) - for ixValue,dtChild in self.dictChildren.iteritems(): - dt.add(dtChild.copy(),ixValue) - return dt - def _append_repr(self,listRepr): - if self.is_leaf(): - listRepr.append("[%s]" % str(self.fLabel)[0]) - else: - sDefaultLabel = str(self.fDefaultLabel)[0] - listRepr.append("<%d,%s,{" % (self.ixAttr, sDefaultLabel)) - for dtChild in self.dictChildren.values(): - dtChild._append_repr(listRepr) - listRepr.append("}>") - def __repr__(self): - listRepr = [] - self._append_repr(listRepr) - return "".join(listRepr) - -def build_tree_rec(setIxAttr, listInst, dblMinGain, cRemainingLevels): - - """Recursively build a decision tree. - - Given a set of integer attributes, a list of instances, a boolean default - label, and a floating-point valued minimum information gain, create - a decision tree leaf or node. - - If there is a common label across all instances in listInst, the function - returns a leaf node with this common label. - - If setIxAttr is empty, the function returns a leaf with the majority label - across listInst. - - If cRemainingLevels is zero, return the majority label. (If - cRemainingLevels is less than zero, then we don't want to do anything - special -- this is our mechanism for ignoring the tree depth limit). - If no separation of the instances yields an information gain greater than - dblMinGain, the function returns a leaf with the majority label across - listInst. - - Otherwise, the function finds the attribute which maximizes information - gain, splits on the attribute, and continues building the tree - recursively. - - When building tree nodes, the function specifies the majority label across - listInst as the node's default label (fDefaultLabel argument to DTree's - __init__). This will be useful in pruning.""" - - - majorityLabel = majority_label(listInst) - if len(setIxAttr) == 0: - return DTree(fLabel = majorityLabel) - if cRemainingLevels == 0: - return DTree(fLabel = majorityLabel) - - commonLabel = check_for_common_label(listInst) - if commonLabel is not None: - return DTree(fLabel = commonLabel) - - ixChosen , dictBest = choose_split_attribute(setIxAttr , listInst , dblMinGain) - if ixChosen is None: - return DTree(fLabel = majorityLabel) - - dt = DTree(ixAttr = ixChosen , fDefaultLabel = majorityLabel) - subsetIxAttr = set(setIxAttr) - set([ixChosen]) - #print subsetIxAttr - for value , attrList in dictBest.items(): - dtChild = build_tree_rec(subsetIxAttr , attrList , dblMinGain , cRemainingLevels - 1) - dt.add(dtChild , value) - - return dt - - - -def count_instance_attributes(listInst): - """Return the number of attributes across all instances, or None if the - instances differ in the number of attributes they contain. - - >>> listInst = [Instance([1,2,3],True), Instance([4,5,6],False)] - >>> count_instance_attributes(listInst) - 3 - >>> count_instance_attributes([Instance([1,2],True),Instance([3],False)]) - """ - countAttr = len(listInst[0].listAttrs) - for inst in listInst: - if countAttr != len(inst.listAttrs): - return None - return countAttr - - - -def build_tree(listInst, dblMinGain=0.0, cMaxLevel=-1): - """Build a decision tree with the ID3 algorithm from a list of - instances.""" - cAttr = count_instance_attributes(listInst) - if cAttr is None: - raise TypeError("Instances provided have attribute lists of " - "varying lengths.") - setIxAttr = set(xrange(cAttr)) - return build_tree_rec(setIxAttr, listInst, dblMinGain, cMaxLevel) - -def classify(dt, inst): - """Using decision tree dt, return the label for instance inst.""" - - if dt.is_leaf(): - return dt.fLabel - value = inst.listAttrs[dt.ixAttr] - if value not in dt.dictChildren: - return dt.fDefaultLabel - return classify(dt.dictChildren[value] , inst) - - - -class EvaluationResult(object): - def __init__(self, listInstCorrect, listInstIncorrect, oClassifier): - self.listInstCorrect = listInstCorrect - self.listInstIncorrect = listInstIncorrect - self.oClassifier = oClassifier - -def weight_correct_incorrect(rslt): - """Return a pair of floating-point numbers denoting the weight of - (correct, incorrect) instances in EvaluationResult rslt. - - >>> listInstCorrect = [Instance([],True,0.25)] - >>> listInstIncorrect = [Instance([],False,0.50)] - >>> rslt = EvaluationResult(listInstCorrect, listInstIncorrect, None) - >>> weight_correct_incorrect(rslt) - (0.25, 0.5)""" - - correctInst = sum([inst.dblWeight for inst in rslt.listInstCorrect]) - incorrectInst = sum([inst.dblWeight for inst in rslt.listInstIncorrect]) - return (correctInst , incorrectInst) - - - -class CrossValidationFold(object): - """Abstract base class for all cross validaiton fold types.""" - def build(self): - # abstract method - raise NotImplemented - def classify(self, dt, inst): - # abstract method - raise NotImplemented - def check_insts(self, listInst): - for inst in (listInst or []): - if inst.fLabel is None: - raise TypeError("missing instance label") - return listInst - -class TreeFold(CrossValidationFold): - def __init__(self, listInstTraining, listInstTest, listInstValidate=None): - super(TreeFold,self).__init__() - self.listInstTraining = self.check_insts(listInstTraining) - self.listInstTest = self.check_insts(listInstTest) - self.listInstValidate = self.check_insts(listInstValidate) - self.cMaxLevel = -1 - def build(self): - return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) - def classify(self, dt, inst): - return classify(dt,inst) - -def evaluate_classification(cvf): - """Given a CrossValidationFold, build a classifier and build an - EvaluationResult that correctly partitions test instances into a list of - correctly and incorrectly classified instances. - - Classifiers can be built using cvf.build(). - Evaluation results are built with - EvaluationResult(listInstCorrect,listInstIncorrect,dt) - where dt is the classifier built with cvf.build().""" - - dt = cvf.build() - listInstCorrect = [] - listInstIncorrect = [] - for inst in cvf.listInstTest: - if inst.fLabel == cvf.classify(dt , inst): - listInstCorrect.append(inst) - else: - listInstIncorrect.append(inst) - - return EvaluationResult(listInstCorrect , listInstIncorrect , dt) - - - -def check_folds(listInst, cFold, cMinFold): - """Raise a ValueError if cFold is greater than the number of instances, or - if cFold is less than the minimum number of folds. - - >>> check_folds([Instance([],True), Instance([],False)], 1, 2) - >>> check_folds([Instance([],True)], 2, 1) - Traceback (most recent call last): - ... - ValueError: Cannot have more folds than instances - >>> check_folds([Instance([],False)], 1, 2) - Traceback (most recent call last): - ... - ValueError: Need at least 2 folds.""" - - if cFold > len(listInst): - raise ValueError("Cannot have more folds than instances") - if cFold < cMinFold: - raise ValueError("'Need at least %d folds' % (cMinFold)") - - return - - -def yield_cv_folds(listInst, cFold): - """Yield a series of TreeFolds, which represent a partition of listInst - into cFold folds. - - You may either return a list, or `yield` (http://goo.gl/gwOfM) - TreeFolds one at a time.""" - - check_folds(listInst, cFold, 2) - - listInstSize = len(listInst) - cFoldSize = int(math.ceil(listInstSize / cFold)) - -# folds = [] -# for i in range(cFold): -# if i == cFold - 1: -# folds.append(listInst[i * cFoldSize : listInstSize]) -# else: -# folds.append(listInst[i * cFoldSize : (i + 1) * cFoldSize]) - -# for i in range(cFold): -# listInstTest = folds[i] -# listInstTraining = [] -# for j in range(cFold): -# if i == j: continue -# listInstTraining += folds[j] -# -# #print len(listInstTest) , len(listInstTraining) -# yield TreeFold(listInstTraining , listInstTest) - - for i in range(cFold): - id1 = i * cFoldSize - id2 = min(listInstSize , (i + 1) * cFoldSize) - listInstTest = listInst[id1 : id2] - listInstTraining = listInst[:id1] - listInstTraining.extend(listInst[id2:]) - yield TreeFold(listInstTraining , listInstTest) - - - - #raise NotImplementedError - -def cv_score(iterableFolds): - """Determine the fraction (by weight) of correct instances across a number - of cross-validation folds.""" - - correct = 0.0 - incorrect = 0.0 - for cvf in iterableFolds: - result = evaluate_classification(cvf) - correctWeight, incorrectWeight = weight_correct_incorrect(result) - - for inst in result.listInstCorrect: - print inst.fLabel, - for inst in result.listInstIncorrect: - print inst.fLabel, - print '\n' - print '-----------------------------------------------' - - #print len(result.listInstCorrect) , len(result.listInstIncorrect) - #print '----------------------------------------------' - #print correctWeight , incorrectWeight - #print '-------------------------------------------' - correct += correctWeight - incorrect += incorrectWeight - #return - - return correct / (correct + incorrect) - - raise NotImplementedError - -def prune_tree(dt, listInst): - """Recursively prune a decision tree. - Given a subtree to prune and a list of instances, - recursively prune the tree, then determine if the current node should - become a leaf. - - The function does not return anything, and instead modifies the tree - in-place.""" - - score = 0.0 - prunedScore = 0.0 - if dt.is_leaf(): return - - dictInst = separate_by_attribute(listInst , dt.ixAttr) - for key , child in dt.dictChildren.items(): - if key not in dictInst: continue - prune_tree(child , dictInst[key]) - - for inst in listInst: - if classify(dt , inst) == inst.fLabel: - score += inst.dblWeight - if dt.fDefaultLabel == inst.fLabel: - prunedScore += inst.dblWeight - - if prunedScore >= score: - dt.convert_to_leaf() - - return - -def build_pruned_tree(listInstTrain, listInstValidate): - - """Build a pruned decision tree from a list of training instances, then - prune the tree using a list of validation instances. - - Return the pruned decision tree.""" - - dt = build_tree(listInstTrain) - pruned_tree(dt , listInstValidate) - return dt - -class PrunedFold(TreeFold): - def __init__(self, *args, **kwargs): - super(PrunedFold,self).__init__(*args,**kwargs) - if self.listInstValidate is None: - raise TypeError("PrunedCrossValidationFold requires " - "listInstValidate argument.") - def build(self): - return build_pruned_tree(self.listInstTraining,self.listInstValidate) - -def yield_cv_folds_with_validation(listInst, cFold): - """Yield a number cFold of PrunedFolds, which together form a partition of - the list of instances listInst. - - You may either return a list or yield successive values.""" - - listInstSize = len(listInst) - cFoldSize = listInstSize / cFold - - folds = [] - for i in range(cFold): - if i == cFold - 1: - folds.append(listInst[i * cFoldSize : listInstSize]) - else: - folds.append(listInst[i * cFoldSize : (i + 1) * cFoldSize]) - - for i in range(cFold - 1): - listInstTest = folds[i] - listInstValidation = folds[i + 1] - listInstTraining = [] - for j in range(cFold): - if i == j or i + 1 == j: continue - listInstTraining += folds[j] - - yield TreeFold(listInstTraining , listInstTest , listInstValidation) - - #raise NotImplementedError - -def normalize_weights(listInst): - """Normalize the weights of all the instances in listInst so that the sum - of their weights totals to 1.0. - - The function modifies the weights of the instances in-place and does - not return anything. - - >>> listInst = [Instance([],True,0.1), Instance([],False,0.3)] - >>> normalize_weights(listInst) - >>> print listInst - [Instance([], True, 0.25), Instance([], False, 0.75)]""" - - wTotal = sum(map(lambda inst : inst.dblWeight , listInst)) - - for inst in listInst: - inst.dblWeight /= wTotal - -def init_weights(listInst): - """Initialize the weights of the instances in listInst so that each - instance has weight 1/(number of instances). This function modifies - the weights in place and does not return anything. - - >>> listInst = [Instance([],True,0.5), Instance([],True,0.25)] - >>> init_weights(listInst) - >>> print listInst - [Instance([], True, 0.50), Instance([], True, 0.50)]""" - - nTotal = len(listInst) - for inst in listInst: - inst.dblWeight = 1.0 / nTotal - return - -def classifier_error(rslt): - """Given and evaluation result, return the (floating-point) fraction - of correct instances by weight. - - >>> listInstCorrect = [Instance([],True,0.15)] - >>> listInstIncorrect = [Instance([],True,0.45)] - >>> rslt = EvaluationResult(listInstCorrect,listInstIncorrect,None) - >>> classifier_error(rslt) - 0.75""" - raise NotImplementedError - -def classifier_weight(dblError): - """Return the classifier weight alpha from the classifier's training - error.""" - raise NotImplementedError - -def update_weight_unnormalized(inst, dblClassifierWeight, fClassifiedLabel): - """Re-weight an instance given the classifier weight, and the label - assigned to the instance by the classifier. This function acts in place - and does not return anything.""" - raise NotImplementedError - -class StumpFold(TreeFold): - def __init__(self, listInstTraining, cMaxLevel=1): - self.listInstTraining = listInstTraining - self.listInstTest = listInstTraining - self.cMaxLevel = cMaxLevel - def build(self): - return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) - -def one_round_boost(listInst, cMaxLevel): - """Conduct a single round of boosting on a list of instances. Returns a - triple (classifier, error, classifier weight). - - Implementation suggestion: - - build a StumpFold from the list of instances and the given - cMaxLevel (it's obnoxious that cMaxLevel has to be passed around - like this -- just pass it into Stumpfold() as the second argument - and you should be fine). - - using the StumpFold, build an EvaluationResult using - evaluate_classification - - get the error rate of the EvaluationResult using classifier_error - - obtain the classifier weight from the classifier error - - update the weight of all instances in the evaluation results - - normalize all weights - - return the EvaluationResult's oClassifier member, the classifier error, - and the classifier weight in a 3-tuple - - remember to return early if the error is zero.""" - raise NotImplementedError - -class BoostResult(object): - def __init__(self, listDblCferWeight, listCfer): - self.listDblCferWeight = listDblCferWeight - self.listCfer = listCfer - -def boost(listInst, cMaxRounds=50, cMaxLevel=1): - """Conduct up to cMaxRounds of boosting on training instances listInst - and return a BoostResult containing the classifiers and their weights.""" - raise NotImplementedError - -def classify_boosted(br,inst): - """Given a BoostResult and an instance, return the (boolean) label - predicted for the instance by the boosted classifier.""" - raise NotImplementedError - -class BoostedFold(TreeFold): - def __init__(self, *args, **kwargs): - super(BoostedFold,self).__init__(*args, **kwargs) - self.cMaxLevel = 1 - self.cMaxRounds = 50 - def build(self): - listInst = [inst.copy() for inst in self.listInstTraining] - return boost(listInst, self.cMaxRounds, self.cMaxLevel) - def classify(self, br, inst): - return classify_boosted(br, inst) - -def yield_boosted_folds(listInst, cFold): - """Yield a number cFold of BoostedFolds, constituting a partition of - listInst. - - Implementation suggestion: Generate TreeFolds, and yield BoostedFolds - built from your TreeFolds.""" - raise NotImplementedError - -def read_csv_dataset(infile): - listInst = [] - for sRow in infile: - listRow = map(int, sRow.strip().split()) - inst = Instance(map(int,listRow[:-1]), bool(listRow[-1])) - listInst.append(inst) - return listInst - -def load_csv_dataset(oFile): - if isinstance(oFile,basestring): - with open(oFile) as infile: return read_csv_dataset(infile) - return read_csv_dataset(infile) - -def main(argv): - import doctest - doctest.testmod() - listInst = load_csv_dataset("data.csv") - cFold = 10 - iterableFolds = yield_cv_folds_with_validation(listInst,cFold) - #iterableFolds = yield_cv_folds(listInst,cFold) - #iterableFolds = yield_boosted_folds(listInst,cFold) - print "%.2f%% correct" % (100.0*cv_score(iterableFolds)) - return 0 - - - -if __name__ == "__main__": - import doctest - doctest.testmod() diff --git a/DecisionTrees &Boosting/dttasks.pyc b/DecisionTrees &Boosting/dttasks.pyc deleted file mode 100644 index e2b2021..0000000 Binary files a/DecisionTrees &Boosting/dttasks.pyc and /dev/null differ diff --git a/DecisionTrees &Boosting/testdtree.pyc b/DecisionTrees &Boosting/testdtree.pyc deleted file mode 100644 index 8b6c04a..0000000 Binary files a/DecisionTrees &Boosting/testdtree.pyc and /dev/null differ diff --git a/DecisionTrees &Boosting/Makefile b/DecisionTreesAndBoosting/Makefile similarity index 100% rename from DecisionTrees &Boosting/Makefile rename to DecisionTreesAndBoosting/Makefile diff --git a/DecisionTrees &Boosting/breast-cancer-wisconsin.names b/DecisionTreesAndBoosting/breast-cancer-wisconsin.names similarity index 100% rename from DecisionTrees &Boosting/breast-cancer-wisconsin.names rename to DecisionTreesAndBoosting/breast-cancer-wisconsin.names diff --git a/DecisionTrees &Boosting/build.log b/DecisionTreesAndBoosting/build.log similarity index 100% rename from DecisionTrees &Boosting/build.log rename to DecisionTreesAndBoosting/build.log diff --git a/DecisionTrees &Boosting/call_graph.png b/DecisionTreesAndBoosting/call_graph.png similarity index 100% rename from DecisionTrees &Boosting/call_graph.png rename to DecisionTreesAndBoosting/call_graph.png diff --git a/DecisionTrees &Boosting/data.csv b/DecisionTreesAndBoosting/data.csv similarity index 100% rename from DecisionTrees &Boosting/data.csv rename to DecisionTreesAndBoosting/data.csv diff --git a/DecisionTreesAndBoosting/dtree.py b/DecisionTreesAndBoosting/dtree.py new file mode 100755 index 0000000..3cd34b7 --- /dev/null +++ b/DecisionTreesAndBoosting/dtree.py @@ -0,0 +1,782 @@ +#!/usr/bin/env python + +""" +dtree.py -- CS181 Assignment 1: Decision Trees + +Implements decision trees, decision stumps, decision tree pruning, and +adaptive boosting. +""" + +import math + + +def log2(dbl): + return math.log(dbl) / math.log(2.0) if dbl > 0.0 else 0.0 + + +class Instance(object): + + """Describes a piece of data. The features are contained in listAttrs, + the instance label in fLabel, and the instance weight (for use in boosting) + in dblWeight.""" + + def __init__(self, listAttrs, fLabel=None, dblWeight=1.0): + self.listAttrs = listAttrs + self.fLabel = fLabel + self.dblWeight = dblWeight + + def copy(self): + return Instance(list(self.listAttrs), self.fLabel, self.dblWeight) + + def __repr__(self): + """This function is called when you 'print' an instance.""" + if self.dblWeight == 1.0: + return "Instance(%r, %r)" % (self.listAttrs, self.fLabel) + return ("Instance(%r, %r, %.2f)" + % (self.listAttrs, self.fLabel, self.dblWeight)) + + +def compute_entropy(dblWeightTrue, dblWeightFalse): + """ Given the total weight of true instances and the total weight + of false instances in a collection, + return the entropy of this collection. + >>> compute_entropy(0.0,1000.0) + -0.0 + >>> compute_entropy(0.0001, 0.0) + -0.0 + >>> compute_entropy(1,1) + 1.0""" + + P = 1.0 * dblWeightTrue / (dblWeightTrue + dblWeightFalse) + entropy = -(P * log2(P) + (1 - P) * log2(1 - P)) + + return entropy + + +def separate_by_attribute(listInst, ixAttr): + """Build a dictionary mapping attribute values to lists of instances. + + >>> separate_by_attribute([Instance([5,0],True),Instance([9,0],True)], 0) + {9: [Instance([9, 0], True)], 5: [Instance([5, 0], True)]}""" + + dictInst = {} + for inst in listInst: + # print inst , ixAttr + featureValue = inst.listAttrs[ixAttr] + if featureValue not in dictInst: + dictInst[featureValue] = [] + dictInst[featureValue].append(inst) + + return dictInst + + +def compute_entropy_of_split(dictInst): + """Compute the average entropy of a mapping of attribute values to lists + of instances. + The average should be weighted by the sum of the weight in each list of + instances. + >>> listInst0 = [Instance([],True,0.5), Instance([],False,0.5)] + >>> listInst1 = [Instance([],False,3.0), Instance([],True,0.0)] + >>> dictInst = {0: listInst0, 1: listInst1} + >>> compute_entropy_of_split(dictInst) + 0.25""" + + wTotal = 0 + weightEntropy = 0 + for values in dictInst.values(): + wt = sum( + map(lambda inst: inst.dblWeight if inst.fLabel else 0, values)) + wf = sum( + map(lambda inst: inst.dblWeight if not inst.fLabel else 0, values)) + w = wt + wf + weightEntropy += w * compute_entropy(wt, wf) + wTotal += w + # print entropy , instNum , posInstNum , negInstNum + + return 1.0 * weightEntropy / wTotal + + +def compute_list_entropy(listInst): + return compute_entropy_of_split({None: listInst}) + + +def choose_split_attribute(iterableIxAttr, listInst, dblMinGain=0.0): + """Given an iterator over attributes, choose the attribute which + maximimizes the information gain of separating a collection of + instances based on that attribute. + Returns a tuple of (the integer best attribute, a dictionary of the + separated instances). + If the best information gain is less than dblMinGain, then return the + pair (None,None). + >>> listInst = [Instance([0,0],False), Instance([0,1],True)] + >>> choose_split_attribute([0,1], listInst) + (1, {0: [Instance([0, 0], False)], 1: [Instance([0, 1], True)]})""" + + entropy = compute_list_entropy(listInst) + + infoGainList = [] + for ixAttr in iterableIxAttr: + dictInst = separate_by_attribute(listInst, ixAttr) + expEntropy = compute_entropy_of_split(dictInst) + infoGain = entropy - expEntropy + infoGainList.append((infoGain, ixAttr, dictInst)) + + infoGainList = sorted(infoGainList, reverse=1) + + # print infoGainList[0][0] + + if infoGainList[0][0] < dblMinGain: + return (None, None) + return (infoGainList[0][1], infoGainList[0][2]) + + +def check_for_common_label(listInst): + """Return the boolean label shared by all instances in the given list of + instances, or None if no such label exists + + >>> check_for_common_label([Instance([],True), Instance([],True)]) + True + >>> check_for_common_label([Instance([],False), Instance([],False)]) + False + >>> check_for_common_label([Instance([],True), Instance([],False)])""" + + instNum = len(listInst) + posNum = len([inst for inst in listInst if inst.fLabel]) + if posNum == instNum: + return True + elif posNum == 0: + return False + return None + + +def majority_label(listInst): + """Return the boolean label with the most weight in the given list of + instances. + + >>> majority_label([Instance([],True,1.0),Instance([],False,0.75)]) + True + >>> listInst =[Instance([],False),Instance([],True),Instance([],False)] + >>> majority_label(listInst) + False""" + + posWeight = 0.0 + negWeight = 0.0 + for inst in listInst: + if inst.fLabel: + posWeight += inst.dblWeight + else: + negWeight += inst.dblWeight + + return True if posWeight > negWeight else False + + +class DTree(object): + + def __init__(self, fLabel=None, ixAttr=None, fDefaultLabel=None): + if fLabel is None and ixAttr is None: + raise TypeError("DTree must be given a label or an attribute," + " but received neither.") + self.fLabel = fLabel + self.ixAttr = ixAttr + self.dictChildren = {} + self.fDefaultLabel = fDefaultLabel + if self.is_node() and self.fDefaultLabel is None: + raise TypeError("Nodes require a valid fDefaultLabel") + + def is_leaf(self): + return self.fLabel is not None + + def is_node(self): + return self.ixAttr is not None + + def add(self, dtChild, v): + if not isinstance(dtChild, self.__class__): + raise TypeError("dtChild was not a DTree") + if v in self.dictChildren: + raise ValueError("Attempted to add a child with" + " an existing attribute value.") + self.dictChildren[v] = dtChild + + def convert_to_leaf(self): + if self.is_leaf(): + return + self.fLabel = self.fDefaultLabel + self.ixAttr = None + self.fDefaultLabel = None + self.dictChildren = {} + # the following methods are used in testing -- you should need + # to worry about them + + def copy(self): + if self.is_leaf(): + return DTree(fLabel=self.fLabel) + dt = DTree(ixAttr=self.ixAttr, fDefaultLabel=self.fDefaultLabel) + for ixValue, dtChild in self.dictChildren.iteritems(): + dt.add(dtChild.copy(), ixValue) + return dt + + def _append_repr(self, listRepr): + if self.is_leaf(): + listRepr.append("[%s]" % str(self.fLabel)[0]) + else: + sDefaultLabel = str(self.fDefaultLabel)[0] + listRepr.append("<%d,%s,{" % (self.ixAttr, sDefaultLabel)) + for dtChild in self.dictChildren.values(): + dtChild._append_repr(listRepr) + listRepr.append("}>") + + def __repr__(self): + listRepr = [] + self._append_repr(listRepr) + return "".join(listRepr) + + +def build_tree_rec(setIxAttr, listInst, dblMinGain, cRemainingLevels): + """Recursively build a decision tree. + + Given a set of integer attributes, a list of instances, a boolean default + label, and a floating-point valued minimum information gain, create + a decision tree leaf or node. + + If there is a common label across all instances in listInst, the function + returns a leaf node with this common label. + + If setIxAttr is empty, the function returns a leaf with the majority label + across listInst. + + If cRemainingLevels is zero, return the majority label. (If + cRemainingLevels is less than zero, then we don't want to do anything + special -- this is our mechanism for ignoring the tree depth limit). + If no separation of the instances yields an information gain greater than + dblMinGain, the function returns a leaf with the majority label across + listInst. + + Otherwise, the function finds the attribute which maximizes information + gain, splits on the attribute, and continues building the tree + recursively. + + When building tree nodes, the function specifies the majority label across + listInst as the node's default label (fDefaultLabel argument to DTree's + __init__). This will be useful in pruning.""" + + majorityLabel = majority_label(listInst) + if len(setIxAttr) == 0: + return DTree(fLabel=majorityLabel) + if cRemainingLevels == 0: + return DTree(fLabel=majorityLabel) + + commonLabel = check_for_common_label(listInst) + if commonLabel is not None: + return DTree(fLabel=commonLabel) + + ixChosen, dictBest = choose_split_attribute( + setIxAttr, listInst, dblMinGain) + if ixChosen is None: + return DTree(fLabel=majorityLabel) + + dt = DTree(ixAttr=ixChosen, fDefaultLabel=majorityLabel) + subsetIxAttr = set(setIxAttr) - set([ixChosen]) + # print subsetIxAttr + for value, attrList in dictBest.items(): + dtChild = build_tree_rec( + subsetIxAttr, attrList, dblMinGain, cRemainingLevels - 1) + dt.add(dtChild, value) + + return dt + + +def count_instance_attributes(listInst): + """Return the number of attributes across all instances, or None if the + instances differ in the number of attributes they contain. + + >>> listInst = [Instance([1,2,3],True), Instance([4,5,6],False)] + >>> count_instance_attributes(listInst) + 3 + >>> count_instance_attributes([Instance([1,2],True),Instance([3],False)]) + """ + countAttr = len(listInst[0].listAttrs) + for inst in listInst: + if countAttr != len(inst.listAttrs): + return None + return countAttr + + +def build_tree(listInst, dblMinGain=0.0, cMaxLevel=-1): + """Build a decision tree with the ID3 algorithm from a list of + instances.""" + cAttr = count_instance_attributes(listInst) + if cAttr is None: + raise TypeError("Instances provided have attribute lists of " + "varying lengths.") + setIxAttr = set(xrange(cAttr)) + return build_tree_rec(setIxAttr, listInst, dblMinGain, cMaxLevel) + + +def classify(dt, inst): + """Using decision tree dt, return the label for instance inst.""" + + if dt.is_leaf(): + return dt.fLabel + value = inst.listAttrs[dt.ixAttr] + if value not in dt.dictChildren: + return dt.fDefaultLabel + return classify(dt.dictChildren[value], inst) + + +class EvaluationResult(object): + + def __init__(self, listInstCorrect, listInstIncorrect, oClassifier): + self.listInstCorrect = listInstCorrect + self.listInstIncorrect = listInstIncorrect + self.oClassifier = oClassifier + + +def weight_correct_incorrect(rslt): + """Return a pair of floating-point numbers denoting the weight of + (correct, incorrect) instances in EvaluationResult rslt. + + >>> listInstCorrect = [Instance([],True,0.25)] + >>> listInstIncorrect = [Instance([],False,0.50)] + >>> rslt = EvaluationResult(listInstCorrect, listInstIncorrect, None) + >>> weight_correct_incorrect(rslt) + (0.25, 0.5)""" + + correctInst = sum([inst.dblWeight for inst in rslt.listInstCorrect]) + incorrectInst = sum([inst.dblWeight for inst in rslt.listInstIncorrect]) + return (correctInst, incorrectInst) + + +class CrossValidationFold(object): + + """Abstract base class for all cross validaiton fold types.""" + + def build(self): + # abstract method + raise NotImplementedError + + def classify(self, dt, inst): + # abstract method + raise NotImplementedError + + def check_insts(self, listInst): + for inst in (listInst or []): + if inst.fLabel is None: + raise TypeError("missing instance label") + return listInst + + +class TreeFold(CrossValidationFold): + + def __init__(self, listInstTraining, listInstTest, listInstValidate=None): + super(TreeFold, self).__init__() + self.listInstTraining = self.check_insts(listInstTraining) + self.listInstTest = self.check_insts(listInstTest) + self.listInstValidate = self.check_insts(listInstValidate) + self.cMaxLevel = -1 + + def build(self): + return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) + + def classify(self, dt, inst): + return classify(dt, inst) + + +def evaluate_classification(cvf): + """Given a CrossValidationFold, build a classifier and build an + EvaluationResult that correctly partitions test instances into a list of + correctly and incorrectly classified instances. + + Classifiers can be built using cvf.build(). + Evaluation results are built with + EvaluationResult(listInstCorrect,listInstIncorrect,dt) + where dt is the classifier built with cvf.build().""" + + dt = cvf.build() + listInstCorrect = [] + listInstIncorrect = [] + for inst in cvf.listInstTest: + # print cvf.classify(dt , inst) , inst + if cvf.classify(dt, inst) == inst.fLabel: + listInstCorrect.append(inst) + else: + listInstIncorrect.append(inst) + + return EvaluationResult(listInstCorrect, listInstIncorrect, dt) + + +def check_folds(listInst, cFold, cMinFold): + """Raise a ValueError if cFold is greater than the number of instances, or + if cFold is less than the minimum number of folds. + + >>> check_folds([Instance([],True), Instance([],False)], 1, 2) + >>> check_folds([Instance([],True)], 2, 1) + Traceback (most recent call last): + ... + ValueError: Cannot have more folds than instances + >>> check_folds([Instance([],False)], 1, 2) + Traceback (most recent call last): + ... + ValueError: Need at least 2 folds.""" + + if cFold > len(listInst): + raise ValueError("Cannot have more folds than instances") + if cFold < cMinFold: + raise ValueError("'Need at least %d folds' % (cMinFold)") + + return + + +def yield_cv_folds(listInst, cFold): + """Yield a series of TreeFolds, which represent a partition of listInst + into cFold folds. + + You may either return a list, or `yield` (http://goo.gl/gwOfM) + TreeFolds one at a time.""" + + check_folds(listInst, cFold, 2) + + listInstSize = len(listInst) + cFoldSize = int(math.ceil(listInstSize / cFold)) + +# folds = [] +# for i in range(cFold): +# if i == cFold - 1: +# folds.append(listInst[i * cFoldSize : listInstSize]) +# else: +# folds.append(listInst[i * cFoldSize : (i + 1) * cFoldSize]) + +# for i in range(cFold): +# listInstTest = folds[i] +# listInstTraining = [] +# for j in range(cFold): +# if i == j: continue +# listInstTraining += folds[j] +# +# print len(listInstTest) , len(listInstTraining) +# yield TreeFold(listInstTraining , listInstTest) + + for i in range(cFold): + id1 = i * cFoldSize + id2 = min(listInstSize, (i + 1) * cFoldSize) + listInstTest = listInst[id1: id2] + listInstTraining = listInst[:id1] + listInstTraining.extend(listInst[id2:]) + yield TreeFold(listInstTraining, listInstTest) + + +def cv_score(iterableFolds): + """Determine the fraction (by weight) of correct instances across a number + of cross-validation folds.""" + + correct = 0.0 + incorrect = 0.0 + for cvf in iterableFolds: + result = evaluate_classification(cvf) + correctWeight, incorrectWeight = weight_correct_incorrect(result) + + for inst in result.listInstCorrect: + print inst.fLabel, + for inst in result.listInstIncorrect: + print inst.fLabel, + print '\n' + print '-----------------------------------------------' + + # print len(result.listInstCorrect) , len(result.listInstIncorrect) + # print '----------------------------------------------' + # print correctWeight , incorrectWeight + # print '-------------------------------------------' + correct += correctWeight + incorrect += incorrectWeight + + return correct / (correct + incorrect) + + +def prune_tree(dt, listInst): + """Recursively prune a decision tree. + Given a subtree to prune and a list of instances, + recursively prune the tree, then determine if the current node should + become a leaf. + + The function does not return anything, and instead modifies the tree + in-place.""" + + score = 0.0 + prunedScore = 0.0 + if dt.is_leaf(): + return + + dictInst = separate_by_attribute(listInst, dt.ixAttr) + for key, child in dt.dictChildren.items(): + if key not in dictInst: + continue + prune_tree(child, dictInst[key]) + + for inst in listInst: + if classify(dt, inst) == inst.fLabel: + score += inst.dblWeight + if dt.fDefaultLabel == inst.fLabel: + prunedScore += inst.dblWeight + + if prunedScore >= score: + dt.convert_to_leaf() + + return + + +def build_pruned_tree(listInstTrain, listInstValidate): + """Build a pruned decision tree from a list of training instances, then + prune the tree using a list of validation instances. + + Return the pruned decision tree.""" + + dt = build_tree(listInstTrain) + prune_tree(dt, listInstValidate) + return dt + + +class PrunedFold(TreeFold): + + def __init__(self, *args, **kwargs): + super(PrunedFold, self).__init__(*args, **kwargs) + if self.listInstValidate is None: + raise TypeError("PrunedCrossValidationFold requires " + "listInstValidate argument.") + + def build(self): + return build_pruned_tree(self.listInstTraining, self.listInstValidate) + + +def yield_cv_folds_with_validation(listInst, cFold): + """Yield a number cFold of PrunedFolds, which together form a partition of + the list of instances listInst. + + You may either return a list or yield successive values.""" + + check_folds(listInst, cFold, 3) + listInstSize = len(listInst) + cFoldSize = int(math.ceil(listInstSize / cFold)) + # print cFold + for i in range(cFold): + id1 = i * cFoldSize + id2 = min(listInstSize, (i + 1) * cFoldSize) + listInstTest = listInst[id1: id2] + if id2 == listInstSize: + listInstValidation = listInst[0:cFoldSize] + listInstTraining = listInst[cFoldSize:id1] + else: + id3 = min(listInstSize, id2 + cFoldSize) + listInstValidation = listInst[id2:id3] + listInstTraining = listInst[:id1] + listInstTraining.extend(listInst[id3:]) + yield PrunedFold(listInstTraining, listInstTest, listInstValidation) + + +def normalize_weights(listInst): + """Normalize the weights of all the instances in listInst so that the sum + of their weights totals to 1.0. + + The function modifies the weights of the instances in-place and does + not return anything. + + >>> listInst = [Instance([],True,0.1), Instance([],False,0.3)] + >>> normalize_weights(listInst) + >>> print listInst + [Instance([], True, 0.25), Instance([], False, 0.75)]""" + + wTotal = sum(map(lambda inst: inst.dblWeight, listInst)) + + for inst in listInst: + inst.dblWeight /= wTotal + + +def init_weights(listInst): + """Initialize the weights of the instances in listInst so that each + instance has weight 1/(number of instances). This function modifies + the weights in place and does not return anything. + + >>> listInst = [Instance([],True,0.5), Instance([],True,0.25)] + >>> init_weights(listInst) + >>> print listInst + [Instance([], True, 0.50), Instance([], True, 0.50)]""" + + nTotal = len(listInst) + for inst in listInst: + inst.dblWeight = 1.0 / nTotal + return + + +def classifier_error(rslt): + """Given and evaluation result, return the (floating-point) fraction + of correct instances by weight. + + >>> listInstCorrect = [Instance([],True,0.15)] + >>> listInstIncorrect = [Instance([],True,0.45)] + >>> rslt = EvaluationResult(listInstCorrect,listInstIncorrect,None) + >>> classifier_error(rslt) + 0.75""" + + correctWeights = sum( + map(lambda inst: inst.dblWeight, rslt.listInstCorrect)) + inCorrectWeights = sum( + map(lambda inst: inst.dblWeight, rslt.listInstIncorrect)) + return 1.0 * inCorrectWeights / (inCorrectWeights + correctWeights) + + +def classifier_weight(dblError): + """Return the classifier weight alpha from the classifier's training + error.""" + + return 0.5 * math.log((1 - dblError) / dblError) + + +def update_weight_unnormalized(inst, dblClassifierWeight, fClassifiedLabel): + """Re-weight an instance given the classifier weight, and the label + assigned to the instance by the classifier. This function acts in place + and does not return anything.""" + + if inst.fLabel != fClassifiedLabel: + inst.dblWeight *= math.pow(math.e, dblClassifierWeight) + else: + inst.dblWeight *= math.pow(math.e, -dblClassifierWeight) + + +class StumpFold(TreeFold): + + def __init__(self, listInstTraining, cMaxLevel=1): + self.listInstTraining = listInstTraining + self.listInstTest = listInstTraining + self.cMaxLevel = cMaxLevel + + def build(self): + return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) + + +def one_round_boost(listInst, cMaxLevel): + """Conduct a single round of boosting on a list of instances. Returns a + triple (classifier, error, classifier weight). + + Implementation suggestion: + - build a StumpFold from the list of instances and the given + cMaxLevel (it's obnoxious that cMaxLevel has to be passed around + like this -- just pass it into Stumpfold() as the second argument + and you should be fine). + - using the StumpFold, build an EvaluationResult using + evaluate_classification + - get the error rate of the EvaluationResult using classifier_error + - obtain the classifier weight from the classifier error + - update the weight of all instances in the evaluation results + - normalize all weights + - return the EvaluationResult's oClassifier member, the classifier error, + and the classifier weight in a 3-tuple + - remember to return early if the error is zero.""" + + stump = StumpFold(listInst, cMaxLevel=cMaxLevel) + result = evaluate_classification(stump) + error = classifier_error(result) + if error == 0: + return result.oClassifier, 0, 1 + classifierWeight = classifier_weight(error) + for inst in listInst: + update_weight_unnormalized( + inst, classifierWeight, classify(result.oClassifier, inst)) + + normalize_weights(listInst) + + return (result.oClassifier, error, classifierWeight) + + +class BoostResult(object): + + def __init__(self, listDblCferWeight, listCfer): + self.listDblCferWeight = listDblCferWeight + self.listCfer = listCfer + + +def boost(listInst, cMaxRounds=50, cMaxLevel=1): + """Conduct up to cMaxRounds of boosting on training instances listInst + and return a BoostResult containing the classifiers and their weights.""" + + listCfer = [] + listDblCferWeight = [] + for iterRound in range(cMaxRounds): + (classifier, error, classifierWeight) = one_round_boost( + listInst, cMaxLevel) + listCfer.append(classifier) + listDblCferWeight.append(classifierWeight) + + return BoostResult(listDblCferWeight, listCfer) + + +def classify_boosted(br, inst): + """Given a BoostResult and an instance, return the (boolean) label + predicted for the instance by the boosted classifier.""" + + res = 0 + for i in range(len(br.listCfer)): + fClassifiedLabel = classify(br.listCfer[i], inst) - 0.5 + res += fClassifiedLabel * br.listDblCferWeight[i] + + return True if res >= 0 else False + + +class BoostedFold(TreeFold): + + def __init__(self, *args, **kwargs): + super(BoostedFold, self).__init__(*args, **kwargs) + self.cMaxLevel = 1 + self.cMaxRounds = 50 + + def build(self): + listInst = [inst.copy() for inst in self.listInstTraining] + return boost(listInst, self.cMaxRounds, self.cMaxLevel) + + def classify(self, br, inst): + return classify_boosted(br, inst) + + +def yield_boosted_folds(listInst, cFold): + """Yield a number cFold of BoostedFolds, constituting a partition of + listInst. + + Implementation suggestion: Generate TreeFolds, and yield BoostedFolds + built from your TreeFolds.""" + boostedFolds = [] + folds = yield_cv_folds(listInst, cFold) + for fold in folds: + boostedFolds.append( + BoostedFold(fold.listInstTraining, fold.listInstTest)) + + return boostedFolds + + +def read_csv_dataset(infile): + listInst = [] + for sRow in infile: + listRow = map(int, sRow.strip().split()) + inst = Instance(map(int, listRow[:-1]), bool(listRow[-1])) + listInst.append(inst) + return listInst + + +def load_csv_dataset(oFile): + if isinstance(oFile, basestring): + with open(oFile) as infile: + return read_csv_dataset(infile) + return read_csv_dataset(infile) + + +def main(argv): + import doctest + doctest.testmod() + listInst = load_csv_dataset("data.csv") + cFold = 10 + iterableFolds = yield_cv_folds_with_validation(listInst, cFold) + #iterableFolds = yield_cv_folds(listInst,cFold) + #iterableFolds = yield_boosted_folds(listInst,cFold) + print "%.2f%% correct" % (100.0 * cv_score(iterableFolds)) + return 0 + + +if __name__ == "__main__": + import doctest + doctest.testmod() diff --git a/DecisionTreesAndBoosting/dtree1.py b/DecisionTreesAndBoosting/dtree1.py new file mode 100755 index 0000000..5566b07 --- /dev/null +++ b/DecisionTreesAndBoosting/dtree1.py @@ -0,0 +1,750 @@ +#!/usr/bin/env python + +""" +dtree.py -- CS181 Assignment 1: Decision Trees + +Implements decision trees, decision stumps, decision tree pruning, and +adaptive boosting. + +TODO : create a proper class and get rid of common code in dtree.py +""" + +import math + + +def log2(dbl): + return math.log(dbl) / math.log(2.0) if dbl > 0.0 else 0.0 + + +class Instance(object): + + """Describes a piece of data. The features are contained in listAttrs, + the instance label in fLabel, and the instance weight (for use in boosting) + in dblWeight.""" + + def __init__(self, listAttrs, fLabel=None, dblWeight=1.0): + self.listAttrs = listAttrs + self.fLabel = fLabel + self.dblWeight = dblWeight + + def copy(self): + return Instance(list(self.listAttrs), self.fLabel, self.dblWeight) + + def __repr__(self): + """This function is called when you 'print' an instance.""" + if self.dblWeight == 1.0: + return "Instance(%r, %r)" % (self.listAttrs, self.fLabel) + return ("Instance(%r, %r, %.2f)" + % (self.listAttrs, self.fLabel, self.dblWeight)) + + +def compute_entropy(dblWeightTrue, dblWeightFalse): + """ Given the total weight of true instances and the total weight + of false instances in a collection, + return the entropy of this collection. + >>> compute_entropy(0.0,1000.0) + -0.0 + >>> compute_entropy(0.0001, 0.0) + -0.0 + >>> compute_entropy(1,1) + 1.0""" + + P = 1.0 * dblWeightTrue / (dblWeightTrue + dblWeightFalse) + entropy = -(P * log2(P) + (1 - P) * log2(1 - P)) + + return entropy + + +def separate_by_attribute(listInst, ixAttr): + """Build a dictionary mapping attribute values to lists of instances. + + >>> separate_by_attribute([Instance([5,0],True),Instance([9,0],True)], 0) + {9: [Instance([9, 0], True)], 5: [Instance([5, 0], True)]}""" + + dictInst = {} + for inst in listInst: + # print inst , ixAttr + featureValue = inst.listAttrs[ixAttr] + if featureValue not in dictInst: + dictInst[featureValue] = [] + dictInst[featureValue].append(inst) + + return dictInst + + +def compute_entropy_of_split(dictInst): + """Compute the average entropy of a mapping of attribute values to lists + of instances. + The average should be weighted by the sum of the weight in each list of + instances. + >>> listInst0 = [Instance([],True,0.5), Instance([],False,0.5)] + >>> listInst1 = [Instance([],False,3.0), Instance([],True,0.0)] + >>> dictInst = {0: listInst0, 1: listInst1} + >>> compute_entropy_of_split(dictInst) + 0.25""" + + wTotal = 0 + weightEntropy = 0 + for values in dictInst.values(): + wt = sum( + map(lambda inst: inst.dblWeight if inst.fLabel else 0, values)) + wf = sum( + map(lambda inst: inst.dblWeight if not inst.fLabel else 0, values)) + w = wt + wf + weightEntropy += w * compute_entropy(wt, wf) + wTotal += w + # print entropy , instNum , posInstNum , negInstNum + + return 1.0 * weightEntropy / wTotal + + +def compute_list_entropy(listInst): + return compute_entropy_of_split({None: listInst}) + + +def choose_split_attribute(iterableIxAttr, listInst, dblMinGain=0.0): + """Given an iterator over attributes, choose the attribute which + maximimizes the information gain of separating a collection of + instances based on that attribute. + Returns a tuple of (the integer best attribute, a dictionary of the + separated instances). + If the best information gain is less than dblMinGain, then return the + pair (None,None). + >>> listInst = [Instance([0,0],False), Instance([0,1],True)] + >>> choose_split_attribute([0,1], listInst) + (1, {0: [Instance([0, 0], False)], 1: [Instance([0, 1], True)]})""" + + entropy = compute_list_entropy(listInst) + + infoGainList = [] + for ixAttr in iterableIxAttr: + dictInst = separate_by_attribute(listInst, ixAttr) + expEntropy = compute_entropy_of_split(dictInst) + infoGain = entropy - expEntropy + infoGainList.append((infoGain, ixAttr, dictInst)) + + infoGainList = sorted(infoGainList, reverse=1) + + # print infoGainList[0][0] + + if infoGainList[0][0] < dblMinGain: + return (None, None) + return (infoGainList[0][1], infoGainList[0][2]) + + +def check_for_common_label(listInst): + """Return the boolean label shared by all instances in the given list of + instances, or None if no such label exists + + >>> check_for_common_label([Instance([],True), Instance([],True)]) + True + >>> check_for_common_label([Instance([],False), Instance([],False)]) + False + >>> check_for_common_label([Instance([],True), Instance([],False)])""" + + instNum = len(listInst) + posNum = len([inst for inst in listInst if inst.fLabel]) + if posNum == instNum: + return True + elif posNum == 0: + return False + return None + + +def majority_label(listInst): + """Return the boolean label with the most weight in the given list of + instances. + + >>> majority_label([Instance([],True,1.0),Instance([],False,0.75)]) + True + >>> listInst =[Instance([],False),Instance([],True),Instance([],False)] + >>> majority_label(listInst) + False""" + + posWeight = 0.0 + negWeight = 0.0 + for inst in listInst: + if inst.fLabel: + posWeight += inst.dblWeight + else: + negWeight += inst.dblWeight + + return True if posWeight >= negWeight else False + + +class DTree(object): + + def __init__(self, fLabel=None, ixAttr=None, fDefaultLabel=None): + if fLabel is None and ixAttr is None: + raise TypeError("DTree must be given a label or an attribute," + " but received neither.") + self.fLabel = fLabel + self.ixAttr = ixAttr + self.dictChildren = {} + self.fDefaultLabel = fDefaultLabel + if self.is_node() and self.fDefaultLabel is None: + raise TypeError("Nodes require a valid fDefaultLabel") + + def is_leaf(self): + return self.fLabel is not None + + def is_node(self): + return self.ixAttr is not None + + def add(self, dtChild, v): + if not isinstance(dtChild, self.__class__): + raise TypeError("dtChild was not a DTree") + if v in self.dictChildren: + raise ValueError("Attempted to add a child with" + " an existing attribute value.") + self.dictChildren[v] = dtChild + + def convert_to_leaf(self): + if self.is_leaf(): + return + self.fLabel = self.fDefaultLabel + self.ixAttr = None + self.fDefaultLabel = None + self.dictChildren = {} + # the following methods are used in testing -- you should need + # to worry about them + + def copy(self): + if self.is_leaf(): + return DTree(fLabel=self.fLabel) + dt = DTree(ixAttr=self.ixAttr, fDefaultLabel=self.fDefaultLabel) + for ixValue, dtChild in self.dictChildren.iteritems(): + dt.add(dtChild.copy(), ixValue) + return dt + + def _append_repr(self, listRepr): + if self.is_leaf(): + listRepr.append("[%s]" % str(self.fLabel)[0]) + else: + sDefaultLabel = str(self.fDefaultLabel)[0] + listRepr.append("<%d,%s,{" % (self.ixAttr, sDefaultLabel)) + for dtChild in self.dictChildren.values(): + dtChild._append_repr(listRepr) + listRepr.append("}>") + + def __repr__(self): + listRepr = [] + self._append_repr(listRepr) + return "".join(listRepr) + + +def build_tree_rec(setIxAttr, listInst, dblMinGain, cRemainingLevels): + """Recursively build a decision tree. + + Given a set of integer attributes, a list of instances, a boolean default + label, and a floating-point valued minimum information gain, create + a decision tree leaf or node. + + If there is a common label across all instances in listInst, the function + returns a leaf node with this common label. + + If setIxAttr is empty, the function returns a leaf with the majority label + across listInst. + + If cRemainingLevels is zero, return the majority label. (If + cRemainingLevels is less than zero, then we don't want to do anything + special -- this is our mechanism for ignoring the tree depth limit). + If no separation of the instances yields an information gain greater than + dblMinGain, the function returns a leaf with the majority label across + listInst. + + Otherwise, the function finds the attribute which maximizes information + gain, splits on the attribute, and continues building the tree + recursively. + + When building tree nodes, the function specifies the majority label across + listInst as the node's default label (fDefaultLabel argument to DTree's + __init__). This will be useful in pruning.""" + + majorityLabel = majority_label(listInst) + if len(setIxAttr) == 0: + return DTree(fLabel=majorityLabel) + if cRemainingLevels == 0: + return DTree(fLabel=majorityLabel) + + commonLabel = check_for_common_label(listInst) + if commonLabel is not None: + return DTree(fLabel=commonLabel) + + ixChosen, dictBest = choose_split_attribute( + setIxAttr, listInst, dblMinGain) + if ixChosen is None: + return DTree(fLabel=majorityLabel) + + dt = DTree(ixAttr=ixChosen, fDefaultLabel=majorityLabel) + subsetIxAttr = set(setIxAttr) - set([ixChosen]) + # print subsetIxAttr + for value, attrList in dictBest.items(): + dtChild = build_tree_rec( + subsetIxAttr, attrList, dblMinGain, cRemainingLevels - 1) + dt.add(dtChild, value) + + return dt + + +def count_instance_attributes(listInst): + """Return the number of attributes across all instances, or None if the + instances differ in the number of attributes they contain. + + >>> listInst = [Instance([1,2,3],True), Instance([4,5,6],False)] + >>> count_instance_attributes(listInst) + 3 + >>> count_instance_attributes([Instance([1,2],True),Instance([3],False)]) + """ + countAttr = len(listInst[0].listAttrs) + for inst in listInst: + if countAttr != len(inst.listAttrs): + return None + return countAttr + + +def build_tree(listInst, dblMinGain=0.0, cMaxLevel=-1): + """Build a decision tree with the ID3 algorithm from a list of + instances.""" + cAttr = count_instance_attributes(listInst) + if cAttr is None: + raise TypeError("Instances provided have attribute lists of " + "varying lengths.") + setIxAttr = set(xrange(cAttr)) + return build_tree_rec(setIxAttr, listInst, dblMinGain, cMaxLevel) + + +def classify(dt, inst): + """Using decision tree dt, return the label for instance inst.""" + + if dt.is_leaf(): + return dt.fLabel + value = inst.listAttrs[dt.ixAttr] + if value not in dt.dictChildren: + return dt.fDefaultLabel + return classify(dt.dictChildren[value], inst) + + +class EvaluationResult(object): + + def __init__(self, listInstCorrect, listInstIncorrect, oClassifier): + self.listInstCorrect = listInstCorrect + self.listInstIncorrect = listInstIncorrect + self.oClassifier = oClassifier + + +def weight_correct_incorrect(rslt): + """Return a pair of floating-point numbers denoting the weight of + (correct, incorrect) instances in EvaluationResult rslt. + + >>> listInstCorrect = [Instance([],True,0.25)] + >>> listInstIncorrect = [Instance([],False,0.50)] + >>> rslt = EvaluationResult(listInstCorrect, listInstIncorrect, None) + >>> weight_correct_incorrect(rslt) + (0.25, 0.5)""" + + correctInst = sum([inst.dblWeight for inst in rslt.listInstCorrect]) + incorrectInst = sum([inst.dblWeight for inst in rslt.listInstIncorrect]) + return (correctInst, incorrectInst) + + +class CrossValidationFold(object): + + """Abstract base class for all cross validaiton fold types.""" + + def build(self): + # abstract method + raise NotImplementedError + + def classify(self, dt, inst): + # abstract method + raise NotImplementedError + + def check_insts(self, listInst): + for inst in (listInst or []): + if inst.fLabel is None: + raise TypeError("missing instance label") + return listInst + + +class TreeFold(CrossValidationFold): + + def __init__(self, listInstTraining, listInstTest, listInstValidate=None): + super(TreeFold, self).__init__() + self.listInstTraining = self.check_insts(listInstTraining) + self.listInstTest = self.check_insts(listInstTest) + self.listInstValidate = self.check_insts(listInstValidate) + self.cMaxLevel = -1 + + def build(self): + return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) + + def classify(self, dt, inst): + return classify(dt, inst) + + +def evaluate_classification(cvf): + """Given a CrossValidationFold, build a classifier and build an + EvaluationResult that correctly partitions test instances into a list of + correctly and incorrectly classified instances. + + Classifiers can be built using cvf.build(). + Evaluation results are built with + EvaluationResult(listInstCorrect,listInstIncorrect,dt) + where dt is the classifier built with cvf.build().""" + + dt = cvf.build() + listInstCorrect = [] + listInstIncorrect = [] + for inst in cvf.listInstTest: + # print cvf.classify(dt , inst) , inst + if cvf.classify(dt, inst) == inst.fLabel: + listInstCorrect.append(inst) + else: + listInstIncorrect.append(inst) + + return EvaluationResult(listInstCorrect, listInstIncorrect, dt) + + +def check_folds(listInst, cFold, cMinFold): + """Raise a ValueError if cFold is greater than the number of instances, or + if cFold is less than the minimum number of folds. + + >>> check_folds([Instance([],True), Instance([],False)], 1, 2) + >>> check_folds([Instance([],True)], 2, 1) + Traceback (most recent call last): + ... + ValueError: Cannot have more folds than instances + >>> check_folds([Instance([],False)], 1, 2) + Traceback (most recent call last): + ... + ValueError: Need at least 2 folds.""" + + if cFold > len(listInst): + raise ValueError("Cannot have more folds than instances") + if cFold < cMinFold: + raise ValueError("'Need at least %d folds' % (cMinFold)") + + return + + +def yield_cv_folds(listInst, cFold): + """Yield a series of TreeFolds, which represent a partition of listInst + into cFold folds. + + You may either return a list, or `yield` (http://goo.gl/gwOfM) + TreeFolds one at a time.""" + + check_folds(listInst, cFold, 2) + + listInstSize = len(listInst) + cFoldSize = int(math.ceil(listInstSize / cFold)) + +# folds = [] +# for i in range(cFold): +# if i == cFold - 1: +# folds.append(listInst[i * cFoldSize : listInstSize]) +# else: +# folds.append(listInst[i * cFoldSize : (i + 1) * cFoldSize]) + +# for i in range(cFold): +# listInstTest = folds[i] +# listInstTraining = [] +# for j in range(cFold): +# if i == j: continue +# listInstTraining += folds[j] +# +# print len(listInstTest) , len(listInstTraining) +# yield TreeFold(listInstTraining , listInstTest) + + for i in range(cFold): + id1 = i * cFoldSize + id2 = min(listInstSize, (i + 1) * cFoldSize) + listInstTest = listInst[id1: id2] + listInstTraining = listInst[:id1] + listInstTraining.extend(listInst[id2:]) + yield TreeFold(listInstTraining, listInstTest) + + + #raise NotImplementedError +def cv_score(iterableFolds): + """Determine the fraction (by weight) of correct instances across a number + of cross-validation folds.""" + + correct = 0.0 + incorrect = 0.0 + for cvf in iterableFolds: + result = evaluate_classification(cvf) + correctWeight, incorrectWeight = weight_correct_incorrect(result) + + for inst in result.listInstCorrect: + print inst.fLabel, + for inst in result.listInstIncorrect: + print inst.fLabel, + print '\n' + print '-----------------------------------------------' + + # print len(result.listInstCorrect) , len(result.listInstIncorrect) + # print '----------------------------------------------' + # print correctWeight , incorrectWeight + # print '-------------------------------------------' + correct += correctWeight + incorrect += incorrectWeight + + return correct / (correct + incorrect) + + +def prune_tree(dt, listInst): + """Recursively prune a decision tree. + Given a subtree to prune and a list of instances, + recursively prune the tree, then determine if the current node should + become a leaf. + + The function does not return anything, and instead modifies the tree + in-place.""" + + score = 0.0 + prunedScore = 0.0 + if dt.is_leaf(): + return + + dictInst = separate_by_attribute(listInst, dt.ixAttr) + for key, child in dt.dictChildren.items(): + if key not in dictInst: + continue + prune_tree(child, dictInst[key]) + + for inst in listInst: + if classify(dt, inst) == inst.fLabel: + score += inst.dblWeight + if dt.fDefaultLabel == inst.fLabel: + prunedScore += inst.dblWeight + + if prunedScore >= score: + dt.convert_to_leaf() + + return + + +def build_pruned_tree(listInstTrain, listInstValidate): + """Build a pruned decision tree from a list of training instances, then + prune the tree using a list of validation instances. + + Return the pruned decision tree.""" + + dt = build_tree(listInstTrain) + prune_tree(dt, listInstValidate) + return dt + + +class PrunedFold(TreeFold): + + def __init__(self, *args, **kwargs): + super(PrunedFold, self).__init__(*args, **kwargs) + if self.listInstValidate is None: + raise TypeError("PrunedCrossValidationFold requires " + "listInstValidate argument.") + + def build(self): + return build_pruned_tree(self.listInstTraining, self.listInstValidate) + + +def yield_cv_folds_with_validation(listInst, cFold): + """Yield a number cFold of PrunedFolds, which together form a partition of + the list of instances listInst. + + You may either return a list or yield successive values.""" + + listInstSize = len(listInst) + cFoldSize = listInstSize / cFold + + folds = [] + for i in range(cFold): + id1 = i * cFoldSize + id2 = min(listInstSize, (i + 1) * cFoldSize) + listInstTest = listInst[id1: id2] + if id2 == listInstSize: + listInstValidation = listInst[0:cFoldSize] + listInstTraining = listInst[cFoldSize:id1] + else: + folds.append(listInst[i * cFoldSize: (i + 1) * cFoldSize]) + + for i in range(cFold - 1): + listInstTest = folds[i] + listInstValidation = folds[i + 1] + listInstTraining = [] + for j in range(cFold): + if i == j or i + 1 == j: + continue + listInstTraining += folds[j] + + yield TreeFold(listInstTraining, listInstTest, listInstValidation) + + #raise NotImplementedError + + +def normalize_weights(listInst): + """Normalize the weights of all the instances in listInst so that the sum + of their weights totals to 1.0. + + The function modifies the weights of the instances in-place and does + not return anything. + + >>> listInst = [Instance([],True,0.1), Instance([],False,0.3)] + >>> normalize_weights(listInst) + >>> print listInst + [Instance([], True, 0.25), Instance([], False, 0.75)]""" + + wTotal = sum(map(lambda inst: inst.dblWeight, listInst)) + + for inst in listInst: + inst.dblWeight /= wTotal + + +def init_weights(listInst): + """Initialize the weights of the instances in listInst so that each + instance has weight 1/(number of instances). This function modifies + the weights in place and does not return anything. + + >>> listInst = [Instance([],True,0.5), Instance([],True,0.25)] + >>> init_weights(listInst) + >>> print listInst + [Instance([], True, 0.50), Instance([], True, 0.50)]""" + + nTotal = len(listInst) + for inst in listInst: + inst.dblWeight = 1.0 / nTotal + return + + +def classifier_error(rslt): + """Given and evaluation result, return the (floating-point) fraction + of correct instances by weight. + + >>> listInstCorrect = [Instance([],True,0.15)] + >>> listInstIncorrect = [Instance([],True,0.45)] + >>> rslt = EvaluationResult(listInstCorrect,listInstIncorrect,None) + >>> classifier_error(rslt) + 0.75""" + raise NotImplementedError + + +def classifier_weight(dblError): + """Return the classifier weight alpha from the classifier's training + error.""" + raise NotImplementedError + + +def update_weight_unnormalized(inst, dblClassifierWeight, fClassifiedLabel): + """Re-weight an instance given the classifier weight, and the label + assigned to the instance by the classifier. This function acts in place + and does not return anything.""" + raise NotImplementedError + + +class StumpFold(TreeFold): + + def __init__(self, listInstTraining, cMaxLevel=1): + self.listInstTraining = listInstTraining + self.listInstTest = listInstTraining + self.cMaxLevel = cMaxLevel + + def build(self): + return build_tree(self.listInstTraining, cMaxLevel=self.cMaxLevel) + + +def one_round_boost(listInst, cMaxLevel): + """Conduct a single round of boosting on a list of instances. Returns a + triple (classifier, error, classifier weight). + + Implementation suggestion: + - build a StumpFold from the list of instances and the given + cMaxLevel (it's obnoxious that cMaxLevel has to be passed around + like this -- just pass it into Stumpfold() as the second argument + and you should be fine). + - using the StumpFold, build an EvaluationResult using + evaluate_classification + - get the error rate of the EvaluationResult using classifier_error + - obtain the classifier weight from the classifier error + - update the weight of all instances in the evaluation results + - normalize all weights + - return the EvaluationResult's oClassifier member, the classifier error, + and the classifier weight in a 3-tuple + - remember to return early if the error is zero.""" + raise NotImplementedError + + +class BoostResult(object): + + def __init__(self, listDblCferWeight, listCfer): + self.listDblCferWeight = listDblCferWeight + self.listCfer = listCfer + + +def boost(listInst, cMaxRounds=50, cMaxLevel=1): + """Conduct up to cMaxRounds of boosting on training instances listInst + and return a BoostResult containing the classifiers and their weights.""" + raise NotImplementedError + + +def classify_boosted(br, inst): + """Given a BoostResult and an instance, return the (boolean) label + predicted for the instance by the boosted classifier.""" + raise NotImplementedError + + +class BoostedFold(TreeFold): + + def __init__(self, *args, **kwargs): + super(BoostedFold, self).__init__(*args, **kwargs) + self.cMaxLevel = 1 + self.cMaxRounds = 50 + + def build(self): + listInst = [inst.copy() for inst in self.listInstTraining] + return boost(listInst, self.cMaxRounds, self.cMaxLevel) + + def classify(self, br, inst): + return classify_boosted(br, inst) + + +def yield_boosted_folds(listInst, cFold): + """Yield a number cFold of BoostedFolds, constituting a partition of + listInst. + + Implementation suggestion: Generate TreeFolds, and yield BoostedFolds + built from your TreeFolds.""" + raise NotImplementedError + + +def read_csv_dataset(infile): + listInst = [] + for sRow in infile: + listRow = map(int, sRow.strip().split()) + inst = Instance(map(int, listRow[:-1]), bool(listRow[-1])) + listInst.append(inst) + return listInst + + +def load_csv_dataset(oFile): + if isinstance(oFile, basestring): + with open(oFile) as infile: + return read_csv_dataset(infile) + return read_csv_dataset(infile) + + +def main(argv): + import doctest + doctest.testmod() + listInst = load_csv_dataset("data.csv") + cFold = 10 + iterableFolds = yield_cv_folds_with_validation(listInst, cFold) + #iterableFolds = yield_cv_folds(listInst,cFold) + #iterableFolds = yield_boosted_folds(listInst,cFold) + print "%.2f%% correct" % (100.0 * cv_score(iterableFolds)) + return 0 + + +if __name__ == "__main__": + import doctest + doctest.testmod() diff --git a/DecisionTrees &Boosting/dttasks.py b/DecisionTreesAndBoosting/dttasks.py similarity index 63% rename from DecisionTrees &Boosting/dttasks.py rename to DecisionTreesAndBoosting/dttasks.py index d93afd3..d4a031b 100755 --- a/DecisionTrees &Boosting/dttasks.py +++ b/DecisionTreesAndBoosting/dttasks.py @@ -9,267 +9,335 @@ from tfutils import tftask import dtree + def serialize_tree(dtRoot): listSrcDestValue = [] - cNodes = 0 - def node_name(dt,ix): + + def node_name(dt, ix): if dt.is_node(): - return "Node %d (Split on %d)" % (ix,dt.ixAttr) - return "Leaf %d (%s)" % (ix,str(dt.fLabel)[0]) - def down(dt,ixParent): + return "Node %d (Split on %d)" % (ix, dt.ixAttr) + return "Leaf %d (%s)" % (ix, str(dt.fLabel)[0]) + + def down(dt, ixParent): if dt.is_leaf(): return - sParentName = node_name(dt,ixParent) - for cValue,dtChild in dt.dictChildren.iteritems(): + sParentName = node_name(dt, ixParent) + for cValue, dtChild in dt.dictChildren.iteritems(): ixNode = len(listSrcDestValue) + 1 - tplEdge = (sParentName,node_name(dtChild,ixNode),cValue) + tplEdge = (sParentName, node_name(dtChild, ixNode), cValue) listSrcDestValue.append(tplEdge) down(dtChild, ixNode) - down(dtRoot,0) + down(dtRoot, 0) listColor = ["#FF0000", "#00FF00", "#0000FF", "#00FFFF", "#FF00FF", "#FFFF00", "#000000", "#FF8800", "#6600DD", "#000055"] listEdge = [] cMinValue = min([tpl[2] for tpl in listSrcDestValue]) - for src,dest,cValue in listSrcDestValue: + for src, dest, cValue in listSrcDestValue: sColor = listColor[(cValue - cMinValue) % len(listColor)] - listEdge.append((src,dest,{"color":sColor})) + listEdge.append((src, dest, {"color": sColor})) return listEdge + def datadir(sPath): - return path.join(path.dirname(__file__),sPath) + return path.join(path.dirname(__file__), sPath) + def get_clean_insts(): return dtree.load_csv_dataset(datadir("data.csv")) + def get_noisy_insts(): return dtree.load_csv_dataset(datadir("noisy.dat")) + class ExampleLogPlotTask(tftask.ChartTask): + def task(self): listP = [] listData = [] - for i in map(float,xrange(0,101)): - listP.append(i/100.0) + for i in map(float, xrange(0, 101)): + listP.append(i / 100.0) dblEntropy = dtree.compute_entropy(i, 100.0 - i) listData.append(dblEntropy) - return {"chart": {"defaultSeriesType":"line"}, + return {"chart": {"defaultSeriesType": "line"}, "title": {"text": "Entropy"}, - "xAxis": {"title":{"text":"p"}}, - "yAxis": {"title": {"text":"entropy"}, "min": 0, "max": 1.1}, - "series": [{"name":"Entropy", "data": zip(listP,listData)}]} + "xAxis": {"title": {"text": "p"}}, + "yAxis": {"title": {"text": "entropy"}, "min": 0, "max": 1.1}, + "series": [{"name": "Entropy", "data": zip(listP, listData)}]} + def get_name(self): return "Plot Entropy Curve" + def get_description(self): return "Generate a curve of entropy as a function of probability." + def get_priority(self): return -1 + class BcwTreeTask(tftask.GraphTask): + def task(self): listInst = get_clean_insts() - f = open('view.txt' , 'w+') + f = open('view.txt', 'w+') for inst in listInst: - f.write(str(inst) + '\n') + f.write(str(inst) + '\n') f.close() dt = dtree.build_tree(listInst) return serialize_tree(dt) + def get_name(self): return "Build BCW Tree" + def get_description(self): return "Build a decision tree for clean (non-noisy) BCW data." + def get_priority(self): return 0 + class BcwTrainAccuracy(tftask.ChartTask): + def task(self): listInstClean = get_clean_insts() listInstNoisy = get_noisy_insts() listData = [] - listNames = ["Clean","Noisy"] - for listInst,sName in zip([listInstClean,listInstNoisy], - listNames): - - dt = dtree.build_tree(listInst) - tf = dtree.TreeFold(listInst,listInst) + listNames = ["Clean", "Noisy"] + for listInst, sName in zip([listInstClean, listInstNoisy], + listNames): + + # dt = dtree.build_tree(listInst) + tf = dtree.TreeFold(listInst, listInst) rslt = dtree.evaluate_classification(tf) - dblCorrect,dblIncorrect = dtree.weight_correct_incorrect(rslt) - dblAccuracy = dblCorrect/(dblCorrect + dblIncorrect) + dblCorrect, dblIncorrect = dtree.weight_correct_incorrect(rslt) + dblAccuracy = dblCorrect / (dblCorrect + dblIncorrect) listData.append(dblAccuracy) - return {"chart": {"defaultSeriesType":"column"}, - "title": {"text": "Clean vs. Noisy Training Set Accuracy"}, - "xAxis": {"categories": listNames}, - "yAxis": {"title": {"text":"Accuracy"}, "min":0.0, "max":1.0}, - "series": [{"name": "Training Set Accuracy", - "data": listData}]} + return { + "chart": + { + "defaultSeriesType": "column"}, + "title": + { + "text": "Clean vs. Noisy Training Set Accuracy" + }, + "xAxis": { + "categories": listNames}, + "yAxis": { + "title": { + "text": "Accuracy" + }, + "min": 0.0, "max": 1.0 + }, + "series": [ + { + "name": "Training Set Accuracy", + "data": listData + } + ] + } + def get_name(self): return "Measure Cross-Validated ID3 Training Set Accuracy" + def get_description(self): return ("Build an unpruned decision tree for both the clean and noisy " "BCW data sets and measure the tree's training set accuracy. " "No cross-validation is performed.") + def get_priority(self): - return 0.5 + return 0.5 + class BcwCrossValidateTask(tftask.ChartTask): + def get_name(self): return "Measure Cross-Validated Performance" + def get_description(self): return ("Build decision trees for clean and noisy BCW data and " "evaluate their performance through 10-fold cross validation.") + def get_priority(self): return 1 - def build_depth_yield(self,iDepth): - def yield_cv_folds(listInst,cFold): - for cvf in dtree.yield_cv_folds(listInst,cFold): + + def build_depth_yield(self, iDepth): + def yield_cv_folds(listInst, cFold): + for cvf in dtree.yield_cv_folds(listInst, cFold): cvf.cMaxLevel = iDepth yield cvf return yield_cv_folds + def task(self): listInstClean = dtree.load_csv_dataset(datadir("data.csv")) listInstNoisy = dtree.load_csv_dataset(datadir("noisy.dat")) cFold = 10 listSeries = [] - for sLbl,fxn in [("Unpruned", dtree.yield_cv_folds), - ("Pruned", dtree.yield_cv_folds_with_validation), - ("Boosted", dtree.yield_boosted_folds), - ("Stumps", self.build_depth_yield(1)), - ("Depth-2", self.build_depth_yield(2))]: + for sLbl, fxn in [("Unpruned", dtree.yield_cv_folds), + ("Pruned", dtree.yield_cv_folds_with_validation), + ("Boosted", dtree.yield_boosted_folds), + ("Stumps", self.build_depth_yield(1)), + ("Depth-2", self.build_depth_yield(2))]: try: - fxnScore = lambda listInst: dtree.cv_score(fxn(listInst,cFold)) - listData = [fxnScore(listInstClean),fxnScore(listInstNoisy)] + fxnScore = lambda listInst: dtree.cv_score( + fxn(listInst, cFold)) + listData = [fxnScore(listInstClean), fxnScore(listInstNoisy)] dictSeries = {"name": sLbl, "data": listData} except NotImplementedError: # we can forget about un-implemented functionality - dictSeries = {"name": sLbl + " (not implemented)", "data":[]} + dictSeries = {"name": sLbl + " (not implemented)", "data": []} listSeries.append(dictSeries) - - return {"chart": {"defaultSeriesType":"column"}, + + return {"chart": {"defaultSeriesType": "column"}, "title": {"text": "Clean vs. Noisy Classification"}, "xAxis": {"categories": ["Clean", "Noisy"]}, "yAxis": {"title": {"text": "Fraction Correct"}, - "min":0.0, "max":1.0}, + "min": 0.0, "max": 1.0}, "series": listSeries} + class BoostingCoefficients(tftask.ChartTask): + def get_name(self): return "Plot Boosting Classifier Weights" + def get_description(self): return ("Run boosting using decision stumps on clean BCW data, then " "plot the weights of the resulting classifiers.") + def get_priority(self): return 4 + def task(self): listInst = dtree.load_csv_dataset(datadir("data.csv")) br = dtree.boost(listInst) - return {"chart": {"defaultSeriesType":"line"}, + return {"chart": {"defaultSeriesType": "line"}, "title": {"text": "Boosting Classifier Weights"}, "xAxis": {"title": {"text": "Classifier Number"}}, "series": [{"name": "Classifier Weights", "data": br.listDblCferWeight}]} + class BcwPrunedDecisionTree(tftask.GraphTask): + def get_name(self): return "Prune BCW Decision Tree" + def get_description(self): return ("Build a decision tree for clean BCW data, " "then prune it using a validation set.") + def get_priority(self): return 2 + def task(self): listInst = dtree.load_csv_dataset(datadir("data.csv")) dt = dtree.build_tree(listInst[:-10]) - dtree.prune_tree(dt,listInst[-10:]) + dtree.prune_tree(dt, listInst[-10:]) return serialize_tree(dt) + class BcwDecisionStump(tftask.GraphTask): + def get_name(self): return "Build Decision Stump" + def get_description(self): return ("Build a decision stump (depth 1 decision tree) for clean " "BCW data.") + def get_priority(self): return 3 + def task(self): listInst = dtree.load_csv_dataset(datadir("data.csv")) dt = dtree.build_tree(listInst, cMaxLevel=1) return serialize_tree(dt) + class BcwCompareBoostingParameters(tftask.ChartTask): + def get_name(self): return "Compare Boosting Parameters" + def get_description(self): return ("Evaluate the performance of boosting for various numbers " "of rounds, and with different weak learners.") + def get_priority(self): return 3.5 + def build_fold_generator(self, cMaxLevel, cMaxRounds): - def yield_folds(listInst,cFold): - for cvf in dtree.yield_boosted_folds(listInst,cFold): + def yield_folds(listInst, cFold): + for cvf in dtree.yield_boosted_folds(listInst, cFold): cvf.cMaxLevel = cMaxLevel cvf.cMaxRounds = cMaxRounds yield cvf return yield_folds + def task(self): listInstClean = get_clean_insts() listInstNoisy = get_noisy_insts() listSeries = [] cFold = 10 - for sName,cMaxLevel,cMaxRounds in [("Depth 1, 10 Rounds", 1, 10), - ("Depth 2, 10 Rounds", 2, 10), - ("Depth 1, 30 Rounds", 1, 30), - ("Depth 2, 30 Rounds", 2, 30)]: - fxnGen = self.build_fold_generator(cMaxLevel,cMaxRounds) - fxnScore = lambda listInst: dtree.cv_score(fxnGen(listInst,cFold)) - listData = [fxnScore(listInstClean),fxnScore(listInstNoisy)] - listSeries.append({"name":sName, "data": listData}) - + for sName, cMaxLevel, cMaxRounds in [("Depth 1, 10 Rounds", 1, 10), + ("Depth 2, 10 Rounds", 2, 10), + ("Depth 1, 30 Rounds", 1, 30), + ("Depth 2, 30 Rounds", 2, 30)]: + fxnGen = self.build_fold_generator(cMaxLevel, cMaxRounds) + fxnScore = lambda listInst: dtree.cv_score(fxnGen(listInst, cFold)) + listData = [fxnScore(listInstClean), fxnScore(listInstNoisy)] + listSeries.append({"name": sName, "data": listData}) + sTitle = "Classification Accuracy For Different Boosting Parameters" - return {"chart": {"defaultSeriesType":"column"}, + return {"chart": {"defaultSeriesType": "column"}, "title": {"text": sTitle}, "xAxis": {"categories": ["Clean", "Noisy"]}, "yAxis": {"title": {"text": "Fraction Correct"}, - "min":0.0, "max":1.0}, + "min": 0.0, "max": 1.0}, "series": listSeries} + class BcwBoostingTrainVsTest(tftask.ChartTask): + def get_name(self): return "Compare Boosting Training- and Test-Set Accuracy" + def get_description(self): return ("Assess the relationship in boosting between cross-validated " "training- and test-set performance on clean BCW data.") + def build_fold_gen(self, cRounds, fUseTraining): - def yield_folds(listInst,cFold): - for cvf in dtree.yield_boosted_folds(listInst,cFold): + def yield_folds(listInst, cFold): + for cvf in dtree.yield_boosted_folds(listInst, cFold): cvf.cMaxRounds = cRounds if fUseTraining: cvf.listInstTest = cvf.listInstTraining yield cvf return yield_folds + def get_priority(self): return 5 + def task(self): listInst = get_clean_insts() cFold = 10 listSeries = [] - for sNamePref,fUseTraining in [("Training", True), ("Test", False)]: + for sNamePref, fUseTraining in [("Training", True), ("Test", False)]: listData = [] - for cRounds in xrange(1,16): - fxnGen = self.build_fold_gen(cRounds,fUseTraining) - listData.append(dtree.cv_score(fxnGen(listInst,cFold))) + for cRounds in xrange(1, 16): + fxnGen = self.build_fold_gen(cRounds, fUseTraining) + listData.append(dtree.cv_score(fxnGen(listInst, cFold))) listSeries.append({"name": sNamePref + " Set Accuracy", "data": listData}) return {"chart": {"defaultSeriesType": "line"}, "title": {"text": "Training- vs. Test-Set Accuracy"}, - "xAxis": {"min": 0, "max":16, "title": {"text":"Rounds"}}, - "yAxis": {"title": {"text":"Accuracy"}}, + "xAxis": {"min": 0, "max": 16, "title": {"text": "Rounds"}}, + "yAxis": {"title": {"text": "Accuracy"}}, "series": listSeries} - - if __name__ == "__main__": btt = BcwTreeTask() print btt.task() print tftask.list_tasks(BcwTreeTask.__module__) - diff --git a/DecisionTrees &Boosting/events.sqlite b/DecisionTreesAndBoosting/events.sqlite similarity index 100% rename from DecisionTrees &Boosting/events.sqlite rename to DecisionTreesAndBoosting/events.sqlite diff --git a/DecisionTrees &Boosting/hw1.bat b/DecisionTreesAndBoosting/hw1.bat similarity index 100% rename from DecisionTrees &Boosting/hw1.bat rename to DecisionTreesAndBoosting/hw1.bat diff --git a/DecisionTrees &Boosting/hw1.pdf b/DecisionTreesAndBoosting/hw1.pdf similarity index 100% rename from DecisionTrees &Boosting/hw1.pdf rename to DecisionTreesAndBoosting/hw1.pdf diff --git a/DecisionTreesAndBoosting/hw1.txt b/DecisionTreesAndBoosting/hw1.txt new file mode 100644 index 0000000..243e5d2 --- /dev/null +++ b/DecisionTreesAndBoosting/hw1.txt @@ -0,0 +1,226 @@ +CS181 Assignment 1: Decision Trees +Professor David Parkes +Out Monday, January 31st +Due at Noon of Friday, February 11th +February 9, 2011 +General Instructions: +You may work with one other person on this assignment. Each group should turn in one writeup. +To submit, copy your assignment files to nice.fas.harvard.edu and run make submit. This assignment consists of a theoretical component and an experimental component. The experimental +component requires you to write code and analyze the effectiveness of different algorithms you implement. For the experimental results, we have provided a graphical interface that will generate the +requisite charts and figures from your code. +In this assignment, you will develop a classifier for medical data. You will be working with a +database of instances describing patients who have been tested for breast cancer. You will develop +a classifier that can classify growths as malignant or benign, based on the results of tests taken by +a patient. The dataset was derived from the Wisconsin breast cancer corpus, obtained from the UC +Irvine machine learning repository at http://archive.ics.uci.edu/ml/. The UC Irvine repository +is an important collection of many of the most frequently used machine learning benchmarks. +You can find the dataset for this assignment, as well as code, at http://www.seas.harvard. +edu/courses/cs181/docs/asst1.tar.gz. The data can be found in data.csv, while noisy.dat +contains the same data with a certain amount of random “noise” added. Each dataset contains a +total of 100 samples. Each sample in the data consists of 9 features, each of which ranges from 1 to +10, and a boolean classification that is 0 or 1. The file breast-cancer-wisconsin.names describes +the features and also contains information about the history of the dataset. +1. [15 Points] Decision Trees and ID3 +(a) [5 Points] Suppose that the ID3 algorithm is in the middle of classifying a data set, and +there are seven instances remaining, with four positive and three negative instances. It +has the choice of splitting on two binary features A and B. When A is true, there are +two positive and two negative instances, while when A is false, there are two positive and +one negative instances. Meanwhile, when B is true there is one positive and one negative +instance, while when B is false there are three positive and two negative instances. +Which feature will ID3 choose to split on? Show the information gain calculations. For +each of the two possible splits, present an informal and brief argument that the split is +more useful than the other. What does this example show about the inductive bias of +ID3? +(b) [5 Points] +Use your work in part (a) to show a tree that ID3 might construct for the following +dataset, in which there are four Boolean features and a Boolean classification. You do +not need to show the information gain computations, but you should briefly justify why a +1 + + particular feature was chosen at each point in the tree. In case a tie needs to be broken, +indicate which other feature(s) could have been chosen. +A B C D Class +T F T F +F +T F F F +T +F F F T +F +T F F T +T +F T T T +T +T T F T +F +F F F T +T +(c) [5 Points] By eyeballing the data, find a simpler tree that has the same training error as +the one produced by ID3. What can we learn from this example about the ID3 algorithm? +2. [77 Points] ID3 with Pruning +In this section, we will implement the following machine learning techniques: +• ID3 +• bottom-up decision tree pruning +• cross-validation +• AdaBoost (to be covered in class on Wednesday, February 2nd) +This will require a substantial amount of code. We’ve provided you with a few resources. +You should begin by downloading the assignment code here: http://www.seas.harvard. +edu/courses/cs181/docs/asst1.tar.gz +You can extract this archive with tar -xvzf asst1.tar.gz on Linux or OS X. On Windows, +we recommend you use 7-Zip to extract the archive. +This portion of the assignment will consist of a series of programming exercises. As you +complete the exercises, you will be able to answer a series of accompanying questions. You +should include your answers in the written portion of the assignment. These questions have +been marked a double arrow like this: +⇒ Who is Spain? +To help you with this assignment, we have provided a number of empty python functions for +you to fill in. You should be able to get an idea of what each function does by looking at it’s +docstring, which is a special comment beginning on the line below the function name. +Furthermore, we have provided you with an extensive test suite to exercise your code. This +test suite contains a set of unit tests. A unit test is piece of code designed to exercise an atomic +piece or “unit” of code and ensure its proper functionality. Unit testing a piece of code allows +you to find bugs earlier and build on top of existing code with confidence. To (optionally) read +more about unit testing, check out Wikipedia’s excellent article on the subject: . +You can run the test suite on the command line (python testdtree.py), or through a handy +web interface that runs in your browser. To use this interface, run ./hw1 on Linux of OS X, +or hw1.bat on Windows. (Note: the .bat file has not been tested. You should contact if you +have trouble starting the graphical interface on Windows.) + + This should bring up the interface: + +Clicking on the name of any test will run it. If the test passes, the square on the right side of +the test name will turn green. If the test fails, the square will turn red, and a button titled +“Show Failure” will appear. Clicking this button will reveal a traceback that may contain +information about the test failure. +The idea behind testing is that it will allow you to quickly build on your work and reduce the +amount of time you will spend debugging. In order to help you see how the functions you’re +implmenting fit together, we’ve included a call graph from the solution code. It is contained +in the file call_graph.png. + + As a warmup, implement the function in dtree.py called compute_entropy. An explanation of +this function is provided in its docstring. In the web interface, run the test_compute_entropy +test. Once your function is working and the test passes, open up your web interface, click the +“Tasks” tab, and then find the task called “Plot Entropy Curve.” It should be the first task +on the list. Click “Run.” This should generate an entropy curve like that shown in the lecture +notes. As you progress through this assignment, you will be able to run the rest of the tasks in +the “Tasks” pane. If run you a task that relies on functionality you have not yet implemented, +or if your code raises an exception, you will see a stack trace which provides the details of the +exception. +(a) [20 Points] First up, we’ll be implementing ID3. Open up dtree.py. Complete the +following functions: +• +• +• +• +• +• +• +• + +separate_by_attribute +compute_entropy_of_split +choose_split_attribute +check_for_common_label +majority_label +build_tree_rec +count_instance_attributes +classify + +Most of these functions will be quite short, often less than ten lines. Whenever possible, +try to use functionality you have already implememented by calling a function you have +already filled-in and tested. For a hint as to which functions you might find useful in implementing function foo, look at the solution code call graph, locate the box for foo, and +see which functions it calls. Once you’ve implemented a function, run any corresponding +tests for that function and make sure they pass. +When you’ve implemented these functions, you will be ready to run another task. Find +the task called “Build BCW Tree,” and click “Run.” This should produce a visualization +of a decision tree built from the BCW dataset. + +Note: In order to build decision trees (using the DTree class) you may find it easiest +to use Python’s keyword argument feature, explained here: http://docs.python.org/ +tutorial/controlflow.html#keyword-arguments +(b) [10 Points] +As a prelude to pruning, we need to implement cross-validation. This functionality is +encompassed by the following functions: +• +• +• +• + +weight_correct_incorrect +evaluate_classification +check_folds +yield_cv_folds + + • cv_score +When you’ve completed these functions, you should be able to run the next two tasks: +“Measure Cross-Validated ID3 Training Set Accuracy” and “Measure Cross-Validated +Performance.” The first task will demonstrate the training set accuracy of ID3 without +prunung. The second task will give you cross-validated test performance on both the +clean and noisy data sets. +(c) [15 Points] Now on to validation-set pruning. In order to get this working, you’ll need +to figure out how to implement cross-validation with a validation set. You’ll need to +implement the following: +• prune_tree +• build_pruned_tree +• yield_cv_folds_with_validation +You should be able to re-run the task named “Measure Cross-Validated Performance.” +and see results for pruned decision trees. You can also see the result of validation set +pruning on a decision tree for the BCW data set when you run “Prune BCW Decision +Tree.” +⇒ Does ID3 suffer from overfitting on this data set? Justify your answer. +(d) [32 Points] Boosting +The boosting paradigm presents another way of overcoming the over-fitting problem. +In this problem, you will implement AdaBoost and experiment with various different +boosting possibilities. +Remember that AdaBoost builds a series of classifiers from the same learner. In each +round of boosting, AdaBoost changes the weight it places on the various instances in +its training set. As preparation for this aspect of the algorithm, the ID3 functionality +you have implemented up to this point has taken instance weight into account. For +example, splitting decisions (in choose_split_attribute) and cross-validated accuracy +calcuations (cv_score) required you to consider the weights of the instances involved in +these operations. +i. ⇒ [4 Points] How does your ID3 implementation make use of instance weight in +the splitting decisions it makes? Explain why AdaBoost on ID3 would not work +if splitting decisions in ID3 were made by counting instances rather than summing +weights. +ii. ⇒ [4 Points] What is the weighted entropy of a set of examples {x1 , . . . , xn } where +target y1 = T but all other targets yi = F and w1 = 0.5 while all other weights are +0.5/(n − 1)? +Now, complete the following functions in order to implement boosting: +• +• +• +• +• +• +• +• +• + +normalize_weights +init_weights +classifier_error +classifier_weight +update_weight_unnormalized +one_round_boost +boost +classify_boosted +yield_boosted_folds + + Once you’ve completed these functions, you should be able to run all remaining tasks. +Using the charts produced by these tasks, answer the following questions in your written +response: +i. ⇒ [6 Points] Compare the effectiveness of boosting to the other methods you implemented previously. What do the relative performances of pruned decision trees and +boosting on the noisy data set imply about types of classification problems in which +boosting is effective? +ii. ⇒ [6 Points] How does the maximum depth of the weak learner affect cross-validated +test performance for boosting on both datasets? How can we explain these results? +iii. ⇒ [6 Points] If we did not know that boosting produces a maximum-margin classifier, what would we find surprising in comparing the results from 10 and 30 rounds +of boosting? +iv. ⇒ [6 Points] What is the relationship between training- and test-set cross-validation +performance over the first fifteen rounds of boosting? +3. [8 Points] Tree Analysis +Choose a particularly effective decision tree on the BCW data set and examine the structure of +the tree, mapping feature indices to qualitative descriptions using the file breast-cancer-wisconsin.names. +Present the tree you choose along with the methodology used to generate the tree. Which features are most important for benign / malignant determination? + + \ No newline at end of file diff --git a/DecisionTrees &Boosting/noisy.dat b/DecisionTreesAndBoosting/noisy.dat similarity index 100% rename from DecisionTrees &Boosting/noisy.dat rename to DecisionTreesAndBoosting/noisy.dat diff --git a/DecisionTrees &Boosting/submit.sh b/DecisionTreesAndBoosting/submit.sh similarity index 100% rename from DecisionTrees &Boosting/submit.sh rename to DecisionTreesAndBoosting/submit.sh diff --git a/DecisionTrees &Boosting/testdtree.py b/DecisionTreesAndBoosting/testdtree.py similarity index 69% rename from DecisionTrees &Boosting/testdtree.py rename to DecisionTreesAndBoosting/testdtree.py index 4f11ec4..e97c38d 100755 --- a/DecisionTrees &Boosting/testdtree.py +++ b/DecisionTreesAndBoosting/testdtree.py @@ -5,56 +5,64 @@ import math import unittest -import dtree +import dtree + def repeated(fn): @functools.wraps(fn) def wrapper(obj, *args, **kwargs): - cRepeat = getattr(obj,"REPEAT") if hasattr(obj,"REPEAT") else 100 + cRepeat = getattr(obj, "REPEAT") if hasattr(obj, "REPEAT") else 100 for _ in xrange(cRepeat): - fn(obj,*args,**kwargs) + fn(obj, *args, **kwargs) wrapper.wrapped = fn return wrapper + def randbool(dblP=0.5): return random.random() < dblP -def randlist(lo,hi,n): - return map(lambda x: x(lo,hi), [random.randint]*n) -def build_one_instance(cAttrs,cValues,fxnGenWeight,fxnGenLabel): - listAttrs = randlist(0,cValues-1,cAttrs) +def randlist(lo, hi, n): + return map(lambda x: x(lo, hi), [random.randint] * n) + + +def build_one_instance(cAttrs, cValues, fxnGenWeight, fxnGenLabel): + listAttrs = randlist(0, cValues - 1, cAttrs) return dtree.Instance(listAttrs, fxnGenLabel(listAttrs), fxnGenWeight()) -def build_instance_generator(dblLabelDist=0.5,cAttrs=10, cValues=4, + +def build_instance_generator(dblLabelDist=0.5, cAttrs=10, cValues=4, fxnGenWeight=None, fxnGenLabel=None): if fxnGenWeight is None: fxnGenWeight = lambda: 1.0 if fxnGenLabel is None: fxnGenLabel = lambda _: randbool(dblLabelDist) + def build_instances(n=1): - build1 = lambda: build_one_instance(cAttrs,cValues,fxnGenWeight, + build1 = lambda: build_one_instance(cAttrs, cValues, fxnGenWeight, fxnGenLabel) return [build1() for _ in xrange(n)] build_instances.cAttrs = cAttrs build_instances.cValues = cValues return build_instances -def build_entropy_one_instances(cAttr,cValue): - listInstTrue = [dtree.Instance([0 for _ in xrange(cAttr)],True) + +def build_entropy_one_instances(cAttr, cValue): + listInstTrue = [dtree.Instance([0 for _ in xrange(cAttr)], True) for f in xrange(cValue)] - listInstFalse = [dtree.Instance([0 for _ in xrange(cAttr)],False,0.5) - for f in xrange(2*cValue)] + listInstFalse = [dtree.Instance([0 for _ in xrange(cAttr)], False, 0.5) + for f in xrange(2 * cValue)] for ixAttr in xrange(cAttr): for ixValue in xrange(cValue): - ixFalse = 2*ixValue + ixFalse = 2 * ixValue listInstTemp = (listInstTrue[ixValue], listInstFalse[ixFalse], - listInstFalse[ixFalse+1]) + listInstFalse[ixFalse + 1]) for inst in listInstTemp: inst.listAttrs[ixAttr] = ixValue return listInstTrue + listInstFalse - + + def force_instance_consistency(listInst): dictMapping = {} for inst in listInst: @@ -64,8 +72,10 @@ def force_instance_consistency(listInst): else: dictMapping[tupleKey] = inst.fLabel + def build_consistent_generator(*args, **kwargs): - fxnGen = build_instance_generator(*args,**kwargs) + fxnGen = build_instance_generator(*args, **kwargs) + @functools.wraps(fxnGen) def wrapper(cInst): listInst = fxnGen(cInst) @@ -73,9 +83,11 @@ def wrapper(cInst): return listInst return wrapper + def build_jagged_instances(): - return [dtree.Instance([0]*random.randint(5,10)) - for _ in xrange(random.randint(25,30))] + return [dtree.Instance([0] * random.randint(5, 10)) + for _ in xrange(random.randint(25, 30))] + class EntropyTest(unittest.TestCase): REPEAT = 100 @@ -83,11 +95,11 @@ class EntropyTest(unittest.TestCase): @repeated def test_compute_entropy(self): - dblK = 1000000.0*random.random() - self.assertAlmostEqual(1.0, dtree.compute_entropy(dblK,dblK)) + dblK = 1000000.0 * random.random() + self.assertAlmostEqual(1.0, dtree.compute_entropy(dblK, dblK)) self.assertAlmostEqual(0.0, dtree.compute_entropy(0.0, dblK)) self.assertAlmostEqual(0.0, dtree.compute_entropy(dblK, 0.0)) - + @repeated def test_separate_by_attribute(self): fxnGen = build_instance_generator(0.5) @@ -96,20 +108,20 @@ def test_separate_by_attribute(self): dictInst = dtree.separate_by_attribute(listInst, ixAttr) setValues = set([inst.listAttrs[ixAttr] for inst in listInst]) self.assertEqual(len(setValues), len(dictInst)) - for cValue,listInstSeparate in dictInst.iteritems(): + for cValue, listInstSeparate in dictInst.iteritems(): for inst in listInstSeparate: self.assertEqual(cValue, inst.listAttrs[ixAttr]) - + @repeated def test_compute_entropy_of_split(self): - cAttrs = random.randint(2,20) - cValues = random.randint(1,30) + cAttrs = random.randint(2, 20) + cValues = random.randint(1, 30) fxnGenOne = lambda _: build_entropy_one_instances(cAttrs, cValues) fxnGenOne.cAttrs = cAttrs fxnGenOne.cValues = cValues fxnGenZero = build_instance_generator(0.0, cAttrs=3) dblDelta = 0.01 - for fxnGen,dblP in zip((fxnGenOne,fxnGenZero,),(1.0,0.0)): + for fxnGen, dblP in zip((fxnGenOne, fxnGenZero, ), (1.0, 0.0)): listInst = fxnGen(self.cInsts) for ixAttr in xrange(fxnGen.cAttrs): dictInst = dtree.separate_by_attribute(listInst, ixAttr) @@ -118,47 +130,49 @@ def test_compute_entropy_of_split(self): "%.3f not within %.3f of expected %.3f" % (dblEntropy, dblDelta, dblP)) - def test_compute_entropy_of_split_weighted(self): + def test_compute_entropy_of_split_weighted(self): fxnGenTrue = build_instance_generator(1.0) fxnGenFalse = build_instance_generator(0.0, fxnGenWeight=lambda: 0.25) cInst = 10 - listInst = fxnGenTrue(cInst) + fxnGenFalse(4*cInst) + listInst = fxnGenTrue(cInst) + fxnGenFalse(4 * cInst) dblEntropy = dtree.compute_entropy_of_split({0: listInst}) self.assertAlmostEqual(1.0, dblEntropy) @repeated def test_choose_split_attribute(self): cAttrs = 4 - ixBest = random.randint(0,cAttrs-1) + ixBest = random.randint(0, cAttrs - 1) + def generate_label(listAttrs): return bool(listAttrs[ixBest] % 2) fxnGen = build_instance_generator(cAttrs=cAttrs, fxnGenLabel=generate_label) listInst = fxnGen(self.cInsts) - ixChosen,dictBest = dtree.choose_split_attribute(range(cAttrs), - listInst, 0.0) - self.assertEqual(ixBest,ixChosen) + ixChosen, dictBest = dtree.choose_split_attribute(range(cAttrs), + listInst, 0.0) + self.assertEqual(ixBest, ixChosen) # should come up w/something stronger - self.assertEqual(type(dictBest),dict) + self.assertEqual(type(dictBest), dict) @repeated def test_check_for_common_label(self): fxnGenTrue = build_instance_generator(1.0) fxnGenFalse = build_instance_generator(0.0) fxnGenNone = build_instance_generator() - listPair = ((fxnGenTrue,True),(fxnGenFalse,False),(fxnGenNone,None),) - for fxnGen,expected in listPair: + listPair = ((fxnGenTrue, True), (fxnGenFalse, False), + (fxnGenNone, None), ) + for fxnGen, expected in listPair: listInst = fxnGen(self.cInsts) fLabel = dtree.check_for_common_label(listInst) self.assertTrue(fLabel is expected, "%s is not %s" - % (fLabel,expected)) + % (fLabel, expected)) @repeated def test_majority_label(self): fxnGenTrue = build_instance_generator(1.0) fxnGenFalse = build_instance_generator(0.0) - cLenTrue = random.randint(5,10) - cLenFalse = random.randint(5,10) + cLenTrue = random.randint(5, 10) + cLenFalse = random.randint(5, 10) if cLenTrue == cLenFalse: cLenTrue += 1 listInst = fxnGenTrue(cLenTrue) + fxnGenFalse(cLenFalse) @@ -168,64 +182,68 @@ def test_majority_label(self): @repeated def test_majority_label_weighted(self): dblScale = 25.0 + def gen_insts_for_label(fLabel): dblW = random.random() * dblScale listInst = [] dblInstWeight = 0.0 while dblInstWeight < dblW: dblNextWeight = random.random() - listInst.append(dtree.Instance([],fLabel,dblNextWeight)) + listInst.append(dtree.Instance([], fLabel, dblNextWeight)) dblInstWeight += dblNextWeight - return listInst,dblInstWeight - listInstT,dblT = gen_insts_for_label(True) - listInstF,dblF = gen_insts_for_label(False) + return listInst, dblInstWeight + listInstT, dblT = gen_insts_for_label(True) + listInstF, dblF = gen_insts_for_label(False) listInstAll = listInstT + listInstF random.shuffle(listInstAll) fMajorityLabel = dtree.majority_label(listInstAll) - self.assertEqual(dblT > dblF, fMajorityLabel) + self.assertEqual(dblT > dblF, fMajorityLabel) + def check_dt_members(dt): if dt.is_leaf() and dt.is_node(): return False, ("Tree is not clearly a leaf or node. Only one" " of fLabel and ixAttr should be not None.") - for cValue,dtChild in dt.dictChildren.iteritems(): - fSuccess,sMsg = check_dt_members(dtChild) + for cValue, dtChild in dt.dictChildren.iteritems(): + fSuccess, sMsg = check_dt_members(dtChild) if not fSuccess: - return fSuccess,sMsg - return True,None + return fSuccess, sMsg + return True, None + class ConstructionTest(unittest.TestCase): - def check_dt(self,dtRoot,cMaxLevel): - def down(dt,cLvl): + + def check_dt(self, dtRoot, cMaxLevel): + def down(dt, cLvl): self.assertTrue(cLvl <= cMaxLevel) if dt.is_node(): for dtChild in dt.dictChildren.values(): - down(dtChild,cLvl+1) - down(dtRoot,0) + down(dtChild, cLvl + 1) + down(dtRoot, 0) - def assert_dt_members(self,dt): - fSuccess,sMsg = check_dt_members(dt) + def assert_dt_members(self, dt): + fSuccess, sMsg = check_dt_members(dt) self.assertTrue(fSuccess, sMsg) @repeated def test_build_tree_rec_leaf(self): fLabel = randbool() - listInst = [dtree.Instance([],fLabel)]*random.randint(1,3) - dt = dtree.build_tree_rec([],listInst,0.0,-1) + listInst = [dtree.Instance([], fLabel)] * random.randint(1, 3) + dt = dtree.build_tree_rec([], listInst, 0.0, -1) self.assert_dt_members(dt) self.assertTrue(dt.is_leaf(), "dt was not a leaf") self.assertEqual(dt.fLabel, fLabel) @repeated def test_build_tree_rec_stump(self): - pairBounds = (5,10) + pairBounds = (5, 10) build_list_inst_bool = (lambda f: - [dtree.Instance([int(f),randbool()],fLabel=f) - for _ in xrange(random.randint(*pairBounds))]) + [dtree.Instance([int(f), randbool()], fLabel=f) + for _ in xrange(random.randint(*pairBounds))]) listInst = build_list_inst_bool(True) + build_list_inst_bool(False) setIxAttr = set(range(2)) cPrevSetIxAttrLen = len(setIxAttr) - dt = dtree.build_tree_rec(setIxAttr, listInst, 0.0,-1) + dt = dtree.build_tree_rec(setIxAttr, listInst, 0.0, -1) self.assert_dt_members(dt) self.assertEqual(cPrevSetIxAttrLen, len(setIxAttr), "setIxAttr changed size in build_tree_rec") @@ -233,7 +251,7 @@ def test_build_tree_rec_stump(self): self.assertEqual(dt.ixAttr, 0) dt0 = dt.dictChildren[0] dt1 = dt.dictChildren[1] - for dtChild,fExpected in ((dt0,False), (dt1,True)): + for dtChild, fExpected in ((dt0, False), (dt1, True)): self.assertTrue(dtChild.is_leaf(), "dtChild was not a leaf") self.assertEqual(dtChild.fLabel, fExpected) @@ -241,29 +259,29 @@ def test_build_tree_rec_stump(self): def test_build_tree_depth_limit(self): fxnGen = build_consistent_generator(10) listInst = fxnGen(100) - cMaxLevel = random.randint(0,3) + cMaxLevel = random.randint(0, 3) dt = dtree.build_tree(listInst, cMaxLevel=cMaxLevel) self.assert_dt_members(dt) - self.check_dt(dt,cMaxLevel) + self.check_dt(dt, cMaxLevel) @repeated def test_build_tree_gain_limit(self): listInst = [] - cAttr = random.randint(5,10) - ixAttrImportant = random.randint(0,cAttr-1) - for _ in xrange(random.randint(25,150)): - listAttr = randlist(0,1,cAttr) + cAttr = random.randint(5, 10) + ixAttrImportant = random.randint(0, cAttr - 1) + for _ in xrange(random.randint(25, 150)): + listAttr = randlist(0, 1, cAttr) fLabel = bool(listAttr[ixAttrImportant]) - listInst.append(dtree.Instance(listAttr,fLabel)) + listInst.append(dtree.Instance(listAttr, fLabel)) dt = dtree.build_tree(listInst, dblMinGain=0.55) self.assert_dt_members(dt) self.assertTrue(dt.is_node()) - self.check_dt(dt,1) + self.check_dt(dt, 1) @repeated def test_count_instance_attributes(self): - cLen = random.randint(3,10) - listInst = [dtree.Instance([0]*cLen)]*random.randint(5,10) + cLen = random.randint(3, 10) + listInst = [dtree.Instance([0] * cLen)] * random.randint(5, 10) cLenObserved = dtree.count_instance_attributes(listInst) self.assertEqual(cLen, cLenObserved) listInstJag = build_jagged_instances() @@ -272,22 +290,23 @@ def test_count_instance_attributes(self): def test_build_tree_raises(self): self.assertRaises(TypeError, dtree.build_tree, build_jagged_instances()) + @repeated def test_build_tree(self): # test case size grows exponentially in this - cAttrs = random.randint(1,5) + cAttrs = random.randint(1, 5) listInst = [] for ixAttr in xrange(cAttrs): - cEach = 2**(cAttrs - ixAttr) - listAttrPrefixLeft = [1]*ixAttr + cEach = 2 ** (cAttrs - ixAttr) + listAttrPrefixLeft = [1] * ixAttr for _ in xrange(cEach): - listAttrSuffix = [0]*(cAttrs - ixAttr) + listAttrSuffix = [0] * (cAttrs - ixAttr) listAttr = listAttrPrefixLeft + listAttrSuffix fLabel = bool(ixAttr % 2) - inst = dtree.Instance(listAttr,fLabel) + inst = dtree.Instance(listAttr, fLabel) listInst.append(inst) dt = dtree.build_tree(listInst) - for ixAttr in xrange(cAttrs-1): + for ixAttr in xrange(cAttrs - 1): self.assertEqual(dt.ixAttr, ixAttr) dtLeft = dt.dictChildren[0] self.assertTrue(dtLeft.is_leaf()) @@ -298,59 +317,65 @@ def test_build_tree(self): @repeated def test_build_tree_no_gain(self): - listAttr = randlist(0,5,10) - listInst = [dtree.Instance(listAttr, randbool())]*random.randint(25,30) + listAttr = randlist(0, 5, 10) + listInst = [dtree.Instance(listAttr, randbool())] * \ + random.randint(25, 30) dt = dtree.build_tree(listInst) fMajorityLabel = dtree.majority_label(listInst) self.assertTrue(dt.is_leaf()) - self.assertEquals(dt.fLabel, fMajorityLabel) + self.assertEquals(dt.fLabel, fMajorityLabel) + -def build_random_tree(cAttr,cValue): +def build_random_tree(cAttr, cValue): def down(listIxAttr): if listIxAttr: ixAttr = random.choice(listIxAttr) listIxAttrNext = list(listIxAttr) listIxAttrNext.remove(ixAttr) - dt = dtree.DTree(ixAttr=ixAttr,fDefaultLabel=randbool()) + dt = dtree.DTree(ixAttr=ixAttr, fDefaultLabel=randbool()) for cV in xrange(cValue): dt.add(down(listIxAttrNext), cV) return dt return dtree.DTree(fLabel=randbool()) return down(range(cAttr)) -def build_random_instance_from_dt(dt,cAttr=None): + +def build_random_instance_from_dt(dt, cAttr=None): listPath = [] while dt.is_node(): - cV,dtChild = random.choice(dt.dictChildren.items()) - listPath.append((dt.ixAttr,cV)) + cV, dtChild = random.choice(dt.dictChildren.items()) + listPath.append((dt.ixAttr, cV)) dt = dtChild assert dt.is_leaf() listAttr = [] - cMaxAttr = max([ixAttr for ixAttr,_ in listPath]) + cMaxAttr = max([ixAttr for ixAttr, _ in listPath]) dictPath = dict(listPath) if cAttr is None: - cAttr = cMaxAttr + random.randint(1,5) + cAttr = cMaxAttr + random.randint(1, 5) for ixAttr in xrange(cAttr): - cV = dictPath[ixAttr] if ixAttr in dictPath else random.randint(0,10) + cV = dictPath[ixAttr] if ixAttr in dictPath else random.randint(0, 10) listAttr.append(cV) - return dtree.Instance(listAttr, dt.fLabel),listPath - + return dtree.Instance(listAttr, dt.fLabel), listPath + + class PredictionTest(unittest.TestCase): + @repeated def test_classify(self): - dt = build_random_tree(4,3) + dt = build_random_tree(4, 3) for _ in xrange(5): - inst,listPath = build_random_instance_from_dt(dt) - fLabel = dtree.classify(dt,inst) + inst, listPath = build_random_instance_from_dt(dt) + fLabel = dtree.classify(dt, inst) self.assertEqual(inst.fLabel, fLabel) @repeated def test_classify_unknown(self): cValue = 3 - dt = build_random_tree(4,cValue) - inst = dtree.Instance(randlist(cValue+1, cValue+5, 4)) - fLabel = dtree.classify(dt,inst) - self.assertEqual(fLabel, dt.fDefaultLabel) + dt = build_random_tree(4, cValue) + inst = dtree.Instance(randlist(cValue + 1, cValue + 5, 4)) + fLabel = dtree.classify(dt, inst) + self.assertEqual(fLabel, dt.fDefaultLabel) + def check_instance_membership(listInstDb, listInstQueries): def make_key(inst): @@ -362,25 +387,27 @@ def make_key(inst): return False return True + class EvaluationTest(unittest.TestCase): REPEAT = 25 - + @repeated def test_evaluate_classification(self): def increase_values(inst): - listIncreased = [c+cValues+1 for c in inst.listAttrs] + listIncreased = [c + cValues + 1 for c in inst.listAttrs] return dtree.Instance(listIncreased, not fMajorityLabel) + def filter_unclassifiable(listInst): dt = dtree.build_tree(listInst) return [inst for inst in listInst - if dtree.classify(dt,inst) == inst.fLabel] + if dtree.classify(dt, inst) == inst.fLabel] cValues = 2 fxnGen = build_instance_generator(cValues=cValues) listInst = fxnGen(15) force_instance_consistency(listInst) listInst = filter_unclassifiable(listInst) fMajorityLabel = dtree.majority_label(listInst) - listInstImpossible = map(increase_values,listInst) + listInstImpossible = map(increase_values, listInst) listInstTest = listInst + listInstImpossible cvf = dtree.TreeFold(listInst, listInstTest) rslt = dtree.evaluate_classification(cvf) @@ -390,7 +417,7 @@ def filter_unclassifiable(listInst): listInst, rslt.listInstCorrect), "Missing correct instances") self.assertTrue(check_instance_membership( listInstImpossible, rslt.listInstIncorrect), - "Missing incorrect instances") + "Missing incorrect instances") @repeated def test_weight_corrrect_incorrect(self): @@ -399,30 +426,33 @@ def make_list(cLen): dblSum = 0.0 for _ in xrange(cLen): dbl = math.exp(-random.random() - 0.1) * 10.0 - listI.append(dtree.Instance([],randbool(),dbl)) + listI.append(dtree.Instance([], randbool(), dbl)) dblSum += dbl - return listI,dblSum - listInstCorrect,dblCorrect = make_list(random.randint(0,10)) - listInstIncorrect,dblIncorrect = make_list(random.randint(0,10)) - rslt = dtree.EvaluationResult(listInstCorrect, listInstIncorrect,None) - dblC,dblI = dtree.weight_correct_incorrect(rslt) - self.assertAlmostEqual(dblCorrect,dblC) - self.assertAlmostEqual(dblIncorrect,dblI) - -def build_foldable_instances(lo=3,hi=10): - cFold = random.randint(lo,hi) - cInsts = random.randint(1,10)*cFold - return [dtree.Instance([i],randbool()) for i in range(cInsts)],cFold + return listI, dblSum + listInstCorrect, dblCorrect = make_list(random.randint(0, 10)) + listInstIncorrect, dblIncorrect = make_list(random.randint(0, 10)) + rslt = dtree.EvaluationResult(listInstCorrect, listInstIncorrect, None) + dblC, dblI = dtree.weight_correct_incorrect(rslt) + self.assertAlmostEqual(dblCorrect, dblC) + self.assertAlmostEqual(dblIncorrect, dblI) + + +def build_foldable_instances(lo=3, hi=10): + cFold = random.randint(lo, hi) + cInsts = random.randint(1, 10) * cFold + return [dtree.Instance([i], randbool()) for i in range(cInsts)], cFold + def build_folded_set(listInst): return set([inst.listAttrs[0] for inst in listInst]) + def is_valid_cvf_builder(obj, fxnBuildCvf, fxnCheckEach, fUseValidation): - listInst,cFold = build_foldable_instances() - cFoldSize = len(listInst)/cFold + listInst, cFold = build_foldable_instances() + cFoldSize = len(listInst) / cFold setI = build_folded_set(listInst) cFoldsYielded = 0 - for cvf in fxnBuildCvf(list(listInst),cFold): + for cvf in fxnBuildCvf(list(listInst), cFold): if not fxnCheckEach(cvf): return False setTrain = build_folded_set(cvf.listInstTraining) @@ -435,21 +465,22 @@ def is_valid_cvf_builder(obj, fxnBuildCvf, fxnCheckEach, fUseValidation): cFoldsInTraining = cFold - 2 else: cFoldsInTraining = cFold - 1 - obj.assertEqual(cFoldSize*cFoldsInTraining, len(setTrain)) + obj.assertEqual(cFoldSize * cFoldsInTraining, len(setTrain)) obj.assertEqual(setI - setTrain - setValidation, setTest) obj.assertEqual(setI - setTest - setValidation, setTrain) obj.assertEqual(setI - setTrain - setTest, setValidation) cFoldsYielded += 1 return cFold == cFoldsYielded + class CrossValidationTest(unittest.TestCase): REPEAT = 15 - + @repeated def test_yield_cv_folds(self): fxnCheck = lambda cvf: isinstance(cvf, dtree.TreeFold) - is_valid_cvf_builder(self, dtree.yield_cv_folds, fxnCheck,False) - + is_valid_cvf_builder(self, dtree.yield_cv_folds, fxnCheck, False) + @repeated def test_cv_score(self): def label_weight(listInst, fLabel): @@ -461,20 +492,20 @@ def label_weight(listInst, fLabel): cValues = 4 fxnGen = build_consistent_generator(cValues=cValues, fxnGenWeight=random.random) - cInst = random.randint(30,60) + cInst = random.randint(30, 60) listLeft = fxnGen(cInst) - listRight = [dtree.Instance([cAttr+cValues+1 + listRight = [dtree.Instance([cAttr + cValues + 1 for cAttr in inst.listAttrs], - inst.fLabel) for inst in fxnGen(cInst)] + inst.fLabel) for inst in fxnGen(cInst)] fMajL = dtree.majority_label(listLeft) fMajR = dtree.majority_label(listRight) - iterableFolds = [dtree.TreeFold(listLeft,listRight), - dtree.TreeFold(listRight,listLeft)] + iterableFolds = [dtree.TreeFold(listLeft, listRight), + dtree.TreeFold(listRight, listLeft)] dblScore = dtree.cv_score(iterableFolds) dblL = label_weight(listRight, fMajL) dblR = label_weight(listLeft, fMajR) dblTotalWeight = sum([inst.dblWeight for inst in listRight + listLeft]) - self.assertAlmostEqual((dblL + dblR)/dblTotalWeight, dblScore) + self.assertAlmostEqual((dblL + dblR) / dblTotalWeight, dblScore) @repeated def test_yield_cv_folds_with_validation(self): @@ -482,9 +513,10 @@ def test_yield_cv_folds_with_validation(self): is_valid_cvf_builder(self, dtree.yield_cv_folds_with_validation, fxnCheck, True) + class PruneTest(unittest.TestCase): REPEAT = 10 - + @repeated def test_prune_tree(self): """ @@ -501,14 +533,16 @@ def test_prune_tree(self): - prune the tree - repeat for the node's parent, continuing up to the root. """ - def set_labels(dtRoot,f): + + def set_labels(dtRoot, f): def down(dt): if dt.is_leaf(): dt.fLabel = f dt.fDefaultLabel = f - map(down,dt.dictChildren.values()) + map(down, dt.dictChildren.values()) down(dtRoot) - def check_passes(dtRoot,dtCheck,inst): + + def check_passes(dtRoot, dtCheck, inst): def down(dt): assert not dt.is_leaf() assert len(dt.dictChildren) == cValue @@ -517,68 +551,71 @@ def down(dt): return down(dtRoot) - cAttr = random.randint(2,4) - cValue = random.randint(2,4) - dtBase = build_random_tree(cAttr,cValue) + cAttr = random.randint(2, 4) + cValue = random.randint(2, 4) + dtBase = build_random_tree(cAttr, cValue) listPath = [] listAttrs = [] - listDt = [] - fTargetValue = True#randbool() + + fTargetValue = True # randbool() set_labels(dtBase, not fTargetValue) - dt = dtBase + dt = dtBase while not dt.is_leaf(): ixValue = random.choice(dt.dictChildren.keys()) listPath.append(ixValue) listAttrs.append(dt.ixAttr) - #print ixValue + # print ixValue dt = dt.dictChildren[ixValue] - #print "-----------------------" + # print "-----------------------" while listPath: listPath.pop() dt = dtRoot = dtBase for ixValue in listPath: - #print ixValue + # print ixValue dt = dt.dictChildren[ixValue] assert dt.is_node() - #print "-----------------------------------" + # print "-----------------------------------" dt.fDefaultLabel = fTargetValue listInst = [] - fxnEnd = lambda: randlist(0,cValue-1,cAttr - len(listPath)) - for _ in xrange(random.randint(1,10)): + fxnEnd = lambda: randlist(0, cValue - 1, cAttr - len(listPath)) + for _ in xrange(random.randint(1, 10)): listValue = listPath + fxnEnd() listInstAttr = [None for _ in xrange(cAttr)] assert len(listValue) == cAttr - for ixValue,ixAttr in zip(listValue,listAttrs): + for ixValue, ixAttr in zip(listValue, listAttrs): listInstAttr[ixAttr] = ixValue inst = dtree.Instance(listInstAttr, fTargetValue) - check_passes(dtRoot,dt,inst) + check_passes(dtRoot, dt, inst) listInst.append(inst) - dtree.prune_tree(dtRoot,listInst) + dtree.prune_tree(dtRoot, listInst) dt = dtRoot - for ix,ixValue in enumerate(listPath): + for ix, ixValue in enumerate(listPath): assert dt.ixAttr == listAttrs[ix] self.assertTrue(dt.is_node(), str(dtRoot)) self.assertTrue(ixValue in dt.dictChildren) dt = dt.dictChildren[ixValue] self.assertTrue(dt.is_leaf(), str(dt)) - + + def is_stump(dt): - for cV,dtChild in dt.dictChildren.iteritems(): + for cV, dtChild in dt.dictChildren.iteritems(): if not dtChild.is_leaf(): return False return True -fxnRandomWeight = lambda: random.random()*1000.0 + 0.1 +fxnRandomWeight = lambda: random.random() * 1000.0 + 0.1 build_random_weight = build_instance_generator(fxnGenWeight=fxnRandomWeight) + class BoostTest(unittest.TestCase): REPEAT = 10 - + @repeated def test_normalize_weights(self): cInst = 100 listInst = build_random_weight(cInst) + def weight_sum(): return sum([inst.dblWeight for inst in listInst], 0.0) self.assertTrue(weight_sum() > 1.0) @@ -591,31 +628,31 @@ def test_init_weights(self): listInst = build_random_weight(cInst) dtree.init_weights(listInst) for inst in listInst: - self.assertAlmostEqual(1.0/float(cInst), inst.dblWeight) + self.assertAlmostEqual(1.0 / float(cInst), inst.dblWeight) @repeated def test_classifier_error(self): cInst = 100 listInst = build_instance_generator()(cInst) - ix = random.randint(0,cInst) + ix = random.randint(0, cInst) rslt = dtree.EvaluationResult(listInst[:ix], listInst[ix:], None) - self.assertAlmostEqual(float(cInst-ix)/float(cInst), + self.assertAlmostEqual(float(cInst - ix) / float(cInst), dtree.classifier_error(rslt)) - + @repeated def test_classifier_weight(self): dblError = random.random() dblWeight = dtree.classifier_weight(dblError) - dblFrac = math.exp(2.0*dblWeight) - self.assertAlmostEqual(dblError, 1.0/(dblFrac + 1.0)) + dblFrac = math.exp(2.0 * dblWeight) + self.assertAlmostEqual(dblError, 1.0 / (dblFrac + 1.0)) @repeated def test_update_weight_unnormalized(self): - dblWeight = random.normalvariate(0.0,1.0) - dblClassifierWeight = random.normalvariate(0.0,10.0) + dblWeight = random.normalvariate(0.0, 1.0) + dblClassifierWeight = random.normalvariate(0.0, 10.0) fLabel = randbool() fClassifiedLabel = randbool() - inst = dtree.Instance([],fLabel=fLabel,dblWeight=dblWeight) + inst = dtree.Instance([], fLabel=fLabel, dblWeight=dblWeight) dtree.update_weight_unnormalized(inst, dblClassifierWeight, fClassifiedLabel) dblWeightNew = inst.dblWeight @@ -634,22 +671,22 @@ def test_one_round_boost(self): listInst = fxnGen(cInst) for inst in listInst: inst.listAttrs[0] = int(inst.fLabel) - listInstIncorrect = random.sample(listInst,cInst/10) + listInstIncorrect = random.sample(listInst, cInst / 10) for inst in listInstIncorrect: inst.fLabel = not inst.listAttrs[0] inst.dblWeight = 0.1 - dt,dblError,dblCferWeight = dtree.one_round_boost(listInst,1) + dt, dblError, dblCferWeight = dtree.one_round_boost(listInst, 1) self.assertTrue(is_stump(dt)) - self.assertAlmostEqual(1.0/91.0, dblError) + self.assertAlmostEqual(1.0 / 91.0, dblError) self.assertAlmostEqual(dtree.classifier_weight(dblError), dblCferWeight) self.assertAlmostEqual(1.0, sum([inst.dblWeight for inst in listInst])) @repeated def test_boost(self): - listAttr = randlist(0,5,10) + listAttr = randlist(0, 5, 10) listInst = [dtree.Instance(listAttr, True) for _ in xrange(100)] - listInstFalse = random.sample(listInst,10) + listInstFalse = random.sample(listInst, 10) for inst in listInstFalse: inst.fLabel = False listInstCopy = [inst.copy() for inst in listInst] @@ -659,7 +696,7 @@ def test_boost(self): @repeated def test_boost_maxrounds(self): - cRound = random.randint(2,25) + cRound = random.randint(2, 25) listInst = build_consistent_generator()(100) br = dtree.boost(listInst, cMaxRounds=cRound) self.assertTrue(len(br.listCfer) <= cRound) @@ -668,23 +705,24 @@ def test_boost_maxrounds(self): @repeated def test_classify_boosted(self): def build_stump(fPolarity): - dt = dtree.DTree(ixAttr=0,fDefaultLabel=True) - dt.add(dtree.DTree(fLabel=fPolarity),0) - dt.add(dtree.DTree(fLabel=not fPolarity),1) + dt = dtree.DTree(ixAttr=0, fDefaultLabel=True) + dt.add(dtree.DTree(fLabel=fPolarity), 0) + dt.add(dtree.DTree(fLabel=not fPolarity), 1) return dt - cCfer = 10 - listCfer = [build_stump(bool(i%2)) for i in xrange(cCfer)] + cCfer = 10 + listCfer = [build_stump(bool(i % 2)) for i in xrange(cCfer)] listWeight = [math.exp(-i) for i in xrange(cCfer)] inst = dtree.Instance([int(randbool())], randbool()) - fLabel = dtree.classify_boosted(dtree.BoostResult(listWeight,listCfer), - inst) + fLabel = dtree.classify_boosted( + dtree.BoostResult(listWeight, listCfer), + inst) self.assertEqual(bool(inst.listAttrs[0]), fLabel) @repeated def test_yield_boosted_folds(self): - fxnCheck = lambda cvf: isinstance(cvf,dtree.BoostedFold) + fxnCheck = lambda cvf: isinstance(cvf, dtree.BoostedFold) is_valid_cvf_builder(self, dtree.yield_boosted_folds, fxnCheck, False) - + if __name__ == "__main__": import sys sys.exit(unittest.main()) diff --git a/DecisionTrees &Boosting/view.txt b/DecisionTreesAndBoosting/view.txt similarity index 100% rename from DecisionTrees &Boosting/view.txt rename to DecisionTreesAndBoosting/view.txt diff --git a/HMM/.hmm.py.swp b/HMM/.hmm.py.swp deleted file mode 100644 index a2a0848..0000000 Binary files a/HMM/.hmm.py.swp and /dev/null differ diff --git a/HMM/classify.py b/HMM/classify.py index ad10007..cc1946b 100755 --- a/HMM/classify.py +++ b/HMM/classify.py @@ -5,17 +5,19 @@ from __future__ import division from optparse import OptionParser import sys -import os -from util import * +from util import ( + normalize_filename, + print_timing, + ) from dataset import DataSet -from hmm import * - -import sys +from hmm import ( + HMM + ) def split_into_categories(d): - """given a dataset d, return a dict mapping categories + """given a dataset d, return a dict mapping categoriess to arrays of observation sequences. Only splits the training data""" a = {} for seqnum in range(len(d.train_output)): @@ -34,13 +36,14 @@ def train_N_state_hmms_from_data(filename, num_states, debug=False): builds a separate hmm for each category in data """ dataset = DataSet(filename) category_seqs = split_into_categories(dataset) - + # Build a hmm for each category in data hmms = {} for cat, seqs in category_seqs.items(): if debug: - print "\n\nLearning %s-state HMM for category %s" % (num_states, cat) - + print "\n\nLearning %s-state HMM for category %s" % ( + num_states, cat) + model = HMM(range(num_states), dataset.outputs) model.learn_from_observations(seqs, debug) hmms[cat] = model @@ -50,8 +53,6 @@ def train_N_state_hmms_from_data(filename, num_states, debug=False): return (hmms, dataset) - - @print_timing def compute_classification_performance(hmms, dataset, debug=False): if debug: @@ -65,20 +66,21 @@ def compute_classification_performance(hmms, dataset, debug=False): log_probs = [(cat, hmms[cat].log_prob_of_sequence(seq)) for cat in hmms.keys()] # Want biggest first... - log_probs.sort(lambda a,b: cmp(b[1], a[1])) + log_probs.sort(lambda a, b: cmp(b[1], a[1])) if debug: - ll_str = " ".join(["%s=%.4f" % (c, v) for c,v in log_probs]) - #print "Actual: %s; [%s]" % (actual_category, ll_str) + ll_str = " ".join(["%s=%.4f" % (c, v) for c, v in log_probs]) + print "Actual: %s; [%s]" % (actual_category, ll_str) # Sorted, so the first one is the one we predicted. best_cat = log_probs[0][0] if actual_category != best_cat: errors += 1 fraction_incorrect = errors * 1.0 / total - #if debug: - print "Classification mistakes: %d / %d = %.3f" % (errors, total, fraction_incorrect) + # if debug: + print "Classification mistakes: %d / %d = %.3f" % ( + errors, total, fraction_incorrect) return fraction_incorrect - + def main(argv=None): if argv is None: @@ -95,18 +97,20 @@ def main(argv=None): print "ERROR: Missing arguments" parser.print_usage() sys.exit(1) - + num_states = int(args[0]) filename = args[1] filename = normalize_filename(filename) # Read all the data, then split it up into each category # Build models from the category data files - hmms, dataset = train_N_state_hmms_from_data(filename, num_states, options.verbose) - + hmms, dataset = train_N_state_hmms_from_data( + filename, num_states, options.verbose) + # See how well we do in classifying test sequences - fraction_incorrect = compute_classification_performance(hmms, dataset, options.verbose) - + fraction_incorrect = compute_classification_performance( + hmms, dataset, options.verbose) + print(fraction_incorrect) return 0 if __name__ == "__main__": diff --git a/HMM/classify.pyc b/HMM/classify.pyc deleted file mode 100644 index 18279f4..0000000 Binary files a/HMM/classify.pyc and /dev/null differ diff --git a/HMM/dataset.py b/HMM/dataset.py index 82d8aac..c8ec3bc 100644 --- a/HMM/dataset.py +++ b/HMM/dataset.py @@ -10,8 +10,10 @@ def list_index(xs): m[x] = i return m + class DataSet: - """ + + """ This class provides the following fields: d.states an array containing the names of all of the states @@ -42,26 +44,25 @@ class DataSet: information on the required format of this file. /""" - def __init__(self, filename, debug=False): self.debug = debug - file = open(filename,"r") + file = open(filename, "r") states = set([]) outputs = set([]) - + # A sequence is a list of (state, output) tuples sequences = [] seq = [] switched = False - for line in file.readlines(): + for line in file.readlines(): line = line.strip() if len(line) == 0: continue - if line == "." or line == "..": + if line == "." or line == "..": # end of sequence sequences.append(seq) seq = [] @@ -74,15 +75,15 @@ def __init__(self, filename, debug=False): sequences = [] else: - words = line.split(); - + words = line.split() + state = words[0] # Keep track of all the states/outputs states.add(state) for output in words[1:]: outputs.add(output) - seq.append( (state, output) ) + seq.append((state, output)) # By the time we get here, better have seen the train/test # divider @@ -92,7 +93,7 @@ def __init__(self, filename, debug=False): # Don't forget to add the last sequence! if len(seq) > 0: sequences.append(seq) - + # Ok, the sequences we have now are the test ones test_sequences = sequences @@ -104,15 +105,19 @@ def __init__(self, filename, debug=False): state_map = list_index(self.states) output_map = list_index(self.outputs) - self.train_state = map((lambda seq: map(lambda p: state_map[p[0]], seq)), - train_sequences) - self.train_output = map((lambda seq: map (lambda p: output_map[p[1]], seq)), - train_sequences) + self.train_state = map( + (lambda seq: map(lambda p: state_map[p[0]], seq)), + train_sequences) + self.train_output = map( + (lambda seq: map(lambda p: output_map[p[1]], seq)), + train_sequences) - self.test_state = map((lambda seq: map (lambda p: state_map[p[0]], seq)), - test_sequences) - self.test_output = map((lambda seq: map (lambda p: output_map[p[1]], seq)), - test_sequences) + self.test_state = map( + (lambda seq: map(lambda p: state_map[p[0]], seq)), + test_sequences) + self.test_output = map( + (lambda seq: map(lambda p: output_map[p[1]], seq)), + test_sequences) if self.debug: print self @@ -138,15 +143,14 @@ def __repr__(self): %s """ % (self.states, - self.outputs, - self.train_state, - self.train_output, - self.test_state, - self.test_output) - + self.outputs, + self.train_state, + self.train_output, + self.test_state, + self.test_output) + if __name__ == "__main__": from sys import argv if len(argv) > 1: d = DataSet(argv[1], True) - diff --git a/HMM/dataset.pyc b/HMM/dataset.pyc deleted file mode 100644 index ddde88f..0000000 Binary files a/HMM/dataset.pyc and /dev/null differ diff --git a/HMM/hmm.py b/HMM/hmm.py index b3a92db..e682a05 100755 --- a/HMM/hmm.py +++ b/HMM/hmm.py @@ -1,267 +1,318 @@ -#!/usr/bin/env python - -from util import * -from numpy import * +#!/usr/bin/env python + +from util import (custom_flatten, + print_timing, + random_from_dist, + array_to_string) +from numpy import ( + shape, + random, + exp, + zeros, + array, + ones) from math import log import copy import sys -# If PRODUCTION is false, don't do smoothing +# If PRODUCTION is false, don't do smoothing PRODUCTION = True # Pretty printing for 1D/2D numpy arrays MAX_PRINTING_SIZE = 30 + def format_array(arr): s = shape(arr) if s[0] > MAX_PRINTING_SIZE or (len(s) == 2 and s[1] > MAX_PRINTING_SIZE): return "[ too many values (%s) ]" % s if len(s) == 1: - return "[ " + ( + return "[ " + ( " ".join(["%.6f" % float(arr[i]) for i in range(s[0])])) + " ]" else: lines = [] for i in range(s[0]): - lines.append("[ " + " ".join(["%.6f" % float(arr[i,j]) for j in range(s[1])]) + " ]") + lines.append("[ " + " ".join(["%.6f" % float(arr[i, j]) + for j in range(s[1])]) + " ]") return "\n".join(lines) - def format_array_print(arr): print format_array(arr) + def init_random_model(N, max_obs, seed=None): - if seed==None: + if seed is None: random.seed() else: random.seed(seed) # Initialize things to random values - tran_model = random.random([N,N]) - obs_model = random.random([N,max_obs]) - initial = random.random([N]) + tran_model = random.random([N, N]) + obs_model = random.random([N, max_obs]) + initial = random.random([N]) - initial = ones([N]) + initial = ones([N]) # Normalize - initial = initial/sum(initial) - for i in range(N): - tran_model[i,:] = tran_model[i,:]/sum(tran_model[i,:]) - obs_model[i,:] = obs_model[i,:]/sum(obs_model[i,:]) - - return (initial, tran_model, obs_model) + initial = initial / sum(initial) + for i in range(N): + tran_model[i, :] = tran_model[i, :]/sum(tran_model[i, :]) + obs_model[i, :] = obs_model[i, :]/sum(obs_model[i, :]) + return (initial, tran_model, obs_model) def string_of_model(model, label): (initial, tran_model, obs_model) = model return """ -Model: %s -initial: +Model: %s +initial: %s -transition: +transition: %s -observation: +observation: %s -""" % (label, +""" % (label, format_array(initial), format_array(tran_model), format_array(obs_model)) - + def check_model(model): """Check that things add to one as they should""" (initial, tran_model, obs_model) = model for state in range(len(initial)): - assert((abs(sum(tran_model[state,:]) - 1)) <= 0.01) - assert((abs(sum(obs_model[state,:]) - 1)) <= 0.01) + assert((abs(sum(tran_model[state, :]) - 1)) <= 0.01) + assert((abs(sum(obs_model[state, :]) - 1)) <= 0.01) assert((abs(sum(initial) - 1)) <= 0.01) def print_model(model, label): check_model(model) - print string_of_model(model, label) + print string_of_model(model, label) + def max_delta(model, new_model): - """Return the largest difference between any two corresponding + """Return the largest difference between any two corresponding values in the models""" - return max( [(abs(model[i] - new_model[i])).max() for i in range(len(model))] ) + return max([( + abs( + model[i] - new_model[i]) + ).max() for i in range(len(model))]) def get_alpha(obs, model): - """ Returns the array of alphas and the log likelyhood of the sequence. - - Note: doing normalization as described in Ghahramani '01--just normalizing - both alpha and beta to sum to 1 at each time step.""" - - (initial, tran_model, obs_model) = model - N = shape(tran_model)[0] - n = len(obs) - loglikelyhood = 0 - - alpha = zeros((n,N)) - alpha[0,:] = initial * obs_model[:,obs[0]] - normalization = sum(alpha[0,:]) - alpha[0,:] /= normalization - loglikelyhood += log(normalization) - - for t in range(1,n): - for j in range(N): - s = sum(tran_model[:,j]*alpha[t-1,:]) - alpha[t,j] = s * obs_model[j,obs[t]] - normalization = sum(alpha[t,:]) - loglikelyhood += log(normalization) - alpha[t,:] /= normalization - - return alpha, loglikelyhood - - -def get_beta(obs,model): - """ Note: doing normalization as described in Ghahramani '01--just normalizing - both alpha and beta to sum to 1 at each time step.""" - - (initial, tran_model, obs_model) = model - N = shape(tran_model)[0] - n = len(obs) - # beta[time,state] - beta = zeros((n,N)) - beta[n-1,:] = ones(N) / N - for t in range(n-2,-1,-1): - for i in range(N): - beta[t,i] = sum(tran_model[i,:]*obs_model[:,obs[t+1]]*beta[t+1,:]) - normalization = sum(beta[t,:]) - beta[t,:] /= normalization - return beta + """ Returns the array of alphas and the log likelyhood of the sequence. + + Note: doing normalization as described in Ghahramani '01--just normalizing + both alpha and beta to sum to 1 at each time step.""" + + (initial, tran_model, obs_model) = model + N = shape(tran_model)[0] + n = len(obs) + loglikelyhood = 0 + alpha = zeros((n, N)) + alpha[0, :] = initial * obs_model[:, obs[0]] + normalization = sum(alpha[0, :]) + alpha[0, :] /= normalization + loglikelyhood += log(normalization) + + for t in range(1, n): + for j in range(N): + s = sum(tran_model[:, j]*alpha[t-1, :]) + alpha[t, j] = s * obs_model[j, obs[t]] + normalization = sum(alpha[t, :]) + loglikelyhood += log(normalization) + alpha[t, :] /= normalization + + return alpha, loglikelyhood + + +def get_beta(obs, model): + """ Note: doing normalization as described + in Ghahramani '01--just normalizing + both alpha and beta to sum to 1 at each time step.""" + + (initial, tran_model, obs_model) = model + N = shape(tran_model)[0] + n = len(obs) + # beta[time,state] + beta = zeros((n, N)) + beta[n-1, :] = ones(N) / N + for t in range(n - 2, -1, -1): + for i in range(N): + beta[t, i] = sum( + tran_model[i, :] + * + obs_model[:, obs[t+1]] + * + beta[t+1, :]) + normalization = sum(beta[t, :]) + beta[t, :] /= normalization + return beta def get_gamma(alpha, beta): - (n,N) = shape(alpha) - gamma = zeros((n,N)) - for t in range(n): - normalization = sum(alpha[t,:]*beta[t,:]) - gamma[t,:] = alpha[t,:] * beta[t,:] / normalization - return gamma - - -def get_xi(obs,alpha, beta, model): - (initial, tran_model, obs_model) = model - N = shape(tran_model)[0] - n = len(obs) - xi = zeros((n, N, N)) - for t in range(n-1): - s = 0 - for i in range(N): - for j in range(N): - xi[t,i,j] = alpha[t,i] * tran_model[i,j] * obs_model[j,obs[t+1]] * beta[t+1,j] - s += xi[t,i,j] - xi[t,:,:] = xi[t,:,:] / s - return xi - - -def compute_expectation_step(obs, N, N_ho, N_h1h2, N_h1, N_h, model, debug=False): - """ E-step, update the sufficient statistics given the current model, - and return the loglikelihood of the dataset under the current model - - obs: the observation sequences in the training data - - N: number of hidden states - - the sufficient statistics, refer to lecture 15 notes, p13 - all are stored in numpy arrays - N_ho: expected number of times in the training data that - an observation is the output in hidden state. - It is a numpy array with the number of rows - equal to the number of hidden states (N) - and the number of cols equal to the number of observations (M) - N_h1h2: expected number of times a transition from one hidden state to another - N_h1: expected number of times in each initial state - N_h: expected of times in each state at all (used for obs model) - - model: the current hmm model of initial, transition and observation probs - debug: for printing out model parameters or not, set to True by -v option in command line - - Return dataset_logliklihood - note get_alpha() returns the likelihood of an observation seq and - note that functions for getting beta, xi and gamma values are also implemented for you""" - - datasetLoglikelihood = 0.0 - for obser in obs: - (alpha , loglikelihood) = get_alpha(obser , model) - beta = get_beta(obser , model) - gamma = get_gamma(alpha , beta) - xi = get_xi(obser , alpha , beta , model) - - datasetLoglikelihood += loglikelihood - - V = shape(N_ho)[1] - n = len(obser) - - for i in range(N): - for j in range(V): - N_ho[i , j] += sum([gamma[t , i] for t in range(n) if obser[t] == j]) - - for i in range(N): - for j in range(N): - N_h1h2[i , j] += sum(xi[:, i , j]) - - N_h1 += gamma[0 , :] - N_h += sum(gamma , 0) - - return datasetLoglikelihood - - -def compute_maximization_step(N, M, N_ho, N_h1h2, N_h1, N_h, model, debug=False): - """M-step, update the hmm model by using the incoming sufficient statistics, and return an updated model - model = (initial, tran_model, obs_model) - - N: number of hidden states - M: number of possible observations - - the sufficient statistics, refer to lecture 15 notes, p13, - all are stored in numpy arrays - N_ho: expected number of times in the training data that - an observation is the output in hidden state. - It is a numpy array with the number of rows - equal to the number of hidden states (N) - and the number of cols equal to the number of observations (M) - N_h1h2: expected number of times a transition from one hidden state to another - N_h1: expected number of times in each initial state - N_h: expected of times in each state at all (used for obs model) - - model: the current hmm model of initial, transition and observation probs - debug: for printing out model parameters or not, set to True by -v option in command line - - Return model, an updated hmm model of initial, transition and observation probs - """ - - (initial , tran_model , obs_model) = model - - initial = N_h1 / sum(N_h1) - for i in range(N): - tran_model[i , :] = N_h1h2[i , :] / sum(N_h1h2[i , :]) - - for i in range(N): - obs_model[i , :] = N_ho[i , :] / sum(N_ho[i , :]) - - return (initial , tran_model , obs_model) + (n, N) = shape(alpha) + gamma = zeros((n, N)) + for t in range(n): + normalization = sum( + alpha[t, :] + * + beta[t, :]) + gamma[t, :] = ( + alpha[t, :] + * + beta[t, :] + / + normalization) + return gamma + + +def get_xi(obs, alpha, beta, model): + (initial, tran_model, obs_model) = model + N = shape(tran_model)[0] + n = len(obs) + xi = zeros((n, N, N)) + for t in range(n - 1): + s = 0 + for i in range(N): + for j in range(N): + xi[t, i, j] = alpha[t, i] * tran_model[i, j] * \ + obs_model[j, obs[t + 1]] * beta[t + 1, j] + s += xi[t, i, j] + xi[t, :, :] = xi[t, :, :] / s + return xi + + +def compute_expectation_step(obs, N, + N_ho, + N_h1h2, N_h1, N_h, model, debug=False): + """ E-step, update the sufficient statistics given the current model, + and return the loglikelihood of the dataset under the current model + + obs: the observation sequences in the training data + + N: number of hidden states + + the sufficient statistics, refer to lecture 15 notes, p13 + all are stored in numpy arrays + N_ho: expected number of times in the training data that + an observation is the output in hidden state. + It is a numpy array with the number of rows + equal to the number of hidden states (N) + and the number of cols equal to the number of observations (M) + N_h1h2: expected number of times a transition from one + hidden state to another + N_h1: expected number of times in each initial state + N_h: expected of times in each state at all (used for obs model) + + model: the current hmm model of initial, transition and observation probs + debug: for printing out model parameters or not, set to True by -v + option in command line + + Return dataset_logliklihood + note get_alpha() returns the likelihood of an observation seq and + note that functions for getting beta, xi and gamma values are + also implemented for you""" + + datasetLoglikelihood = 0.0 + for obser in obs: + (alpha, loglikelihood) = get_alpha(obser, model) + beta = get_beta(obser, model) + gamma = get_gamma(alpha, beta) + xi = get_xi(obser, alpha, beta, model) + + datasetLoglikelihood += loglikelihood + + V = shape(N_ho)[1] + n = len(obser) + + for i in range(N): + for j in range(V): + N_ho[i, j] += sum([gamma[t, i] + for t in range(n) if obser[t] == j]) + + for i in range(N): + for j in range(N): + N_h1h2[i, j] += sum(xi[:, i, j]) + + N_h1 += gamma[0, :] + N_h += sum(gamma, 0) + + return datasetLoglikelihood + + +def compute_maximization_step(N, + M, + N_ho, + N_h1h2, + N_h1, + N_h, + model, + debug=False + ): + """M-step, update the hmm model by using the incoming + sufficient statistics, + and return an updated model + model = (initial, tran_model, obs_model) + + N: number of hidden states + M: number of possible observations + + the sufficient statistics, refer to lecture 15 notes, p13, + all are stored in numpy arrays + N_ho: expected number of times in the training data that + an observation is the output in hidden state. + It is a numpy array with the number of rows + equal to the number of hidden states (N) + and the number of cols equal to the number of observations (M) + N_h1h2: expected number of times a transition from one +hidden state to another + N_h1: expected number of times in each initial state + N_h: expected of times in each state at all (used for obs model) + + model: the current hmm model of initial, transition and observation probs + debug: for printing out model parameters or not, set to True + by -v option in command line + + Return model, an updated hmm model of initial, + transition and observation probs + """ + + (initial, tran_model, obs_model) = model + + initial = N_h1 / sum(N_h1) + for i in range(N): + tran_model[i, :] = N_h1h2[i, + :] / sum( + N_h1h2[i, + :]) + + for i in range(N): + obs_model[i, :] = N_ho[i, + :] / sum(N_ho[i, + :]) + return (initial, tran_model, obs_model) # Note: This implementation is as presented in the Rabiner '89 HMM tutorial. # Variable definitions # obs = list of numpy arrays representing multiple observation sequences # K = the number of observation sequences -# N = num hidden states +# N = num hidden states # M = number of possible observations (assuming 0-indexed) # num_iters = maximum number of iterations allowed (if set to 0 then no limit) # For each observation sequence: # n = number of observations in the sequence. (indexed 0..n-1) -def baumwelch(obs,N,M, num_iters=0, debug=True,init_model=None, flag=False): +def baumwelch(obs, N, M, num_iters=0, debug=True, init_model=None, flag=False): K = len(obs) if debug: @@ -271,25 +322,25 @@ def baumwelch(obs,N,M, num_iters=0, debug=True,init_model=None, flag=False): if debug: print "smoothing", PRODUCTION - if init_model == None: + if init_model is None: if debug: seed = 42 else: # Just making things deterministic for now. # Change to "seed = None" if you want to experiment with # random restart, for example. - seed = 42 - model = init_random_model(N,M, seed) + seed = 42 + model = init_random_model(N, M, seed) else: model = init_model if debug: print_model(model, "Initial model") - + # Loop variables iters = 1 # Keep track of the likelihood of the observation sequences - loglikelihoods = [] + loglikelihoods = [] while True: if debug: print "\n\n======= Starting iteration %d ========" % iters @@ -299,33 +350,34 @@ def baumwelch(obs,N,M, num_iters=0, debug=True,init_model=None, flag=False): if smoothing: # Using prior that we've been in every state once, and seen # uniform everything. - N_ho = ones((N,M)) / M - N_h1h2 = ones((N,N)) / N + N_ho = ones((N, M)) / M + N_h1h2 = ones((N, N)) / N # Number of times in each initial state (for init model) N_h1 = ones(N) / N - + # Number of times in each state at all (for obs model) N_h = ones(N) else: - N_ho = zeros((N,M)) - N_h1h2 = zeros((N,N)) + N_ho = zeros((N, M)) + N_h1h2 = zeros((N, N)) # Number of times in each initial state (for init model) N_h1 = zeros(N) - + # Number of times in each state at all (for obs model) N_h = zeros(N) - old_model = copy.deepcopy(model) - + #### Expectation step #### - #N_ho, N_h1h2, N_h1, N_h are numpy arrays and are passed by reference, updated through "side-effects" - dataset_loglikelihood = compute_expectation_step(obs, N, N_ho, N_h1h2, N_h1, N_h, model, debug) + # N_ho, N_h1h2, N_h1, N_h are numpy arrays and are passed by reference, + # updated through "side-effects" + dataset_loglikelihood = compute_expectation_step( + obs, N, N_ho, N_h1h2, N_h1, N_h, model, debug) loglikelihoods.append(dataset_loglikelihood) ### Maximization step ### - model = compute_maximization_step(N, M, N_ho, N_h1h2, N_h1, N_h, model, debug) - + model = compute_maximization_step( + N, M, N_ho, N_h1h2, N_h1, N_h, model, debug) # Termination if debug: @@ -333,7 +385,7 @@ def baumwelch(obs,N,M, num_iters=0, debug=True,init_model=None, flag=False): delta = max_delta(model, old_model) if debug: print "Iters = %d, delta = %f, Log prob of sequences: %f" % ( - iters, delta, loglikelihoods[-1]) + iters, delta, loglikelihoods[-1]) sys.stdout.flush() iters += 1 @@ -342,28 +394,34 @@ def baumwelch(obs,N,M, num_iters=0, debug=True,init_model=None, flag=False): if len(loglikelihoods) > 1: cur = loglikelihoods[-1] prev = loglikelihoods[-2] - - improvement = (cur-prev) / abs(prev) - # Two ways to stop: + improvement = (cur - prev) / abs(prev) + + # Two ways to stop: # (1) the probs stop changing epsilon = 0.001 if delta < epsilon: if debug: print "Converged to within %f!\n\n" % epsilon break - + # (2) the improvement in log likelyhood is too small to bother smaller = 0.0002 if improvement < smaller: if debug: - print "Converged. Log likelyhood improvement was less that %f.\n\n" % smaller + print ( + "Converged. " + "Log likelyhood improvement was less that %f.\n\n" % + smaller) break - + if num_iters: - if iters-1 == num_iters: + if iters - 1 == num_iters: if debug: - print "Maximum number of iterations (%d iterations) reached.\n\n" % (iters-1) + print ( + "Maximum number of iterations" + " (%d iterations) reached.\n\n" % ( + iters - 1)) break (initial, tran_model, obs_model) = model @@ -373,12 +431,10 @@ def baumwelch(obs,N,M, num_iters=0, debug=True,init_model=None, flag=False): return tran_model, obs_model, initial, loglikelihoods - - - - class HMM: + """ HMM Class that defines the parameters for HMM """ + def __init__(self, states, outputs): """If the hmm is going to be trained from data with labeled states, states should be a list of the state names. If the HMM is @@ -390,7 +446,7 @@ def __init__(self, states, outputs): self.num_states = n_s self.num_outputs = n_o self.initial = zeros(n_s) - self.transition = zeros([n_s,n_s]) + self.transition = zeros([n_s, n_s]) self.observation = zeros([n_s, n_o]) def set_hidden_model(self, init, trans, observ): @@ -401,7 +457,7 @@ def set_hidden_model(self, init, trans, observ): self.transition = array(trans) self.observation = array(observ) self.compute_logs() - + def get_model(self): return (self.initial, self.transition, self.observation) @@ -411,57 +467,80 @@ def compute_logs(self): self.log_initial = f(self.initial) self.log_transition = map(f, self.transition) self.log_observation = map(f, self.observation) - def __repr__(self): - return """states = %s + return ( + """states = %s observations = %s %s -""" % (" ".join(array_to_string(self.states)), - " ".join(array_to_string(self.outputs)), - string_of_model((self.initial, self.transition, self.observation), "")) +""" % + ( + " ".join( + array_to_string( + self.states + ) + ), + " ".join( + array_to_string( + self.outputs + ) + ), + string_of_model( + ( + self.initial, + self.transition, + self.observation + ), + "") + ) + ) - # declare the @ decorator just before the function, invokes print_timing() @print_timing def learn_from_labeled_data(self, state_seqs, obs_seqs): - """ - Learn the parameters given state and observations sequences. - Tje ordering of states in states[i][j] must correspond with observations[i][j]. - Uses Laplacian smoothing to avoid zero probabilities. - """ + """ + Learn the parameters given state and observations sequences. + The ordering of states in states[i][j] + must correspond with observations[i][j]. + Uses Laplacian smoothing to avoid zero probabilities. + """ - # Fill this in... + # Fill this in... # self.initial = normalize(...) # self.transition = ... # self.observation = ... # self.compute_logs() - - prefix = zeros(self.num_states) - for state in state_seqs: - self.initial[state[0]] += 1 - for i in range(len(state) - 1): - self.transition[state[i]][state[i+1]] += 1 - prefix[state[i]] += 1 - #prefix[state[-1]] += 1 - - for i in range(self.num_states): - self.initial[i] = (self.initial[i] + 1.0) / (len(state_seqs) + self.num_states) - for j in range(self.num_states): - self.transition[i][j] = (self.transition[i][j] + 1.0) / (prefix[i] + self.num_states) - - prefix = zeros(self.num_states) - for i in range(len(state_seqs)): - for j in range(len(state_seqs[i])) : - self.observation[state_seqs[i][j]][obs_seqs[i][j]] += 1 - prefix[state_seqs[i][j]] += 1 - - for i in range(self.num_states): - for j in range(self.num_outputs): - self.observation[i][j] = (self.observation[i][j] + 1.0) / (prefix[i] + self.num_outputs) - - self.compute_logs() - + + prefix = zeros(self.num_states) + for state in state_seqs: + self.initial[state[0]] += 1 + for i in range(len(state) - 1): + self.transition[state[i]][state[i + 1]] += 1 + prefix[state[i]] += 1 + #prefix[state[-1]] += 1 + + for i in range(self.num_states): + self.initial[i] = (self.initial[i] + 1.0) / \ + (len(state_seqs) + self.num_states) + for j in range(self.num_states): + self.transition[i][j] = ( + self.transition[i][j] + 1.0) / ( + prefix[i] + self.num_states) + + prefix = zeros(self.num_states) + for i in range(len(state_seqs)): + for j in range(len(state_seqs[i])): + self.observation[state_seqs[i][j]][obs_seqs[i][j]] += 1 + prefix[state_seqs[i][j]] += 1 + + for i in range(self.num_states): + for j in range(self.num_outputs): + self.observation[i][j] = ( + self.observation[i][j] + 1.0) / ( + prefix[i] + self.num_outputs) + + self.compute_logs() + # declare the @ decorator just before the function, invokes print_timing() @print_timing def learn_from_observations(self, instances, debug=False, flag=False): @@ -470,36 +549,36 @@ def learn_from_observations(self, instances, debug=False, flag=False): This would find the maximum likelyhood transition model, observation model, and initial probabilities. """ - #def baumwelch(obs,N,M, num_iters=0, debug=False,init_model=None, flag=False): + # def baumwelch(obs,N,M, num_iters=0, debug=False,init_model=None, + # flag=False): loglikelihoods = None if not flag: - (self.transition, + (self.transition, self.observation, self.initial) = baumwelch(instances, - len(self.states), - len(self.outputs), + len(self.states), + len(self.outputs), 0, debug) else: - (self.transition, + (self.transition, self.observation, self.initial, loglikelihoods) = baumwelch(instances, - len(self.states), - len(self.outputs), - 0, - debug, None, flag) - - + len(self.states), + len(self.outputs), + 0, + debug, None, flag) + self.compute_logs() if flag: - return loglikelihoods + return loglikelihoods # Return the log probability that this hmm assigns to a particular output # sequence def log_prob_of_sequence(self, sequence): - model = (self.initial, self.transition, self.observation) + model = (self.initial, self.transition, self.observation) alpha, loglikelyhood = get_alpha(sequence, model) return loglikelyhood @@ -510,12 +589,12 @@ def most_likely_states(self, sequence, debug=False): """ # Code modified from wikipedia # Change this to use logs - + cnt = 0 states = range(0, self.num_states) T = {} for state in states: - ## V.path V. prob. + # V.path V. prob. output = sequence[0] p = self.log_initial[state] + self.log_observation[state][output] T[state] = ([state], p) @@ -546,7 +625,7 @@ def most_likely_states(self, sequence, debug=False): argmax = [next_state, argmax] U[next_state] = (argmax, valmax) T = U - ## apply sum/max to the final states: + # apply sum/max to the final states: argmax = None valmax = None for state in states: @@ -560,51 +639,55 @@ def most_likely_states(self, sequence, debug=False): ans = custom_flatten(argmax) ans.reverse() return ans - + def gen_random_sequence(self, n): """ - Use the underlying model to generate a sequence of + Use the underlying model to generate a sequence of n (state, observation) pairs """ # pick a starting point - state = random_from_dist(self.initial); + state = random_from_dist(self.initial) obs = random_from_dist(self.observation[state]) - seq = [(state,obs)] - for i in range(1,n): + seq = [(state, obs)] + for i in range(1, n): state = random_from_dist(self.transition[state]) obs = random_from_dist(self.observation[state]) - seq.append( (state, obs) ) + seq.append((state, obs)) return seq - + def get_wikipedia_model(): # From the rainy/sunny example on wikipedia (viterbi page) - hmm = HMM(['Rainy','Sunny'], ['walk','shop','clean']) + hmm = HMM(['Rainy', 'Sunny'], ['walk', 'shop', 'clean']) init = [0.6, 0.4] - trans = [[0.7,0.3], [0.4,0.6]] - observ = [[0.1,0.4,0.5], [0.6,0.3,0.1]] + trans = [[0.7, 0.3], [0.4, 0.6]] + observ = [[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]] hmm.set_hidden_model(init, trans, observ) return hmm + def get_toy_model(): - hmm = HMM(['h1','h2'], ['A','B']) + hmm = HMM(['h1', 'h2'], ['A', 'B']) init = [0.6, 0.4] - trans = [[0.7,0.3], [0.4,0.6]] - observ = [[0.1,0.9], [0.9,0.1]] + trans = [[0.7, 0.3], [0.4, 0.6]] + observ = [[0.1, 0.9], [0.9, 0.1]] hmm.set_hidden_model(init, trans, observ) return hmm - + def test(): - hmm = get_wikipedia_model() - print "HMM is:" - print hmm - - seq = [0,1,2] - logp = hmm.log_prob_of_sequence(seq) - p = exp(logp) - print "prob ([walk, shop, clean]): logp= %f p= %f" % (logp, p) - print "most likely states (walk, shop, clean) = %s" % hmm.most_likely_states(seq) + hmm = get_wikipedia_model() + print "HMM is:" + print hmm + + seq = [0, 1, 2] + logp = hmm.log_prob_of_sequence(seq) + p = exp(logp) + print "prob ([walk, shop, clean]): logp= %f p= %f" % (logp, p) + print ( + "most likely states (walk, shop, clean) = %s" % + hmm.most_likely_states(seq) + ) if __name__ == "__main__": test() diff --git a/HMM/hmm.pyc b/HMM/hmm.pyc deleted file mode 100644 index 7e9a7d7..0000000 Binary files a/HMM/hmm.pyc and /dev/null differ diff --git a/HMM/task_hmm.py b/HMM/task_hmm.py index 6c4105d..eb88ae3 100755 --- a/HMM/task_hmm.py +++ b/HMM/task_hmm.py @@ -4,180 +4,208 @@ task_hmm.py -- Visualizations for hmms. """ -from os import path -import random +#from os import path +#import random from tfutils import tftask -from hmm import * -from viterbi import * -from classify import * +from hmm import HMM +from viterbi import run_viterbi, train_hmm_from_data +from classify import ( + compute_classification_performance, + split_into_categories, + train_N_state_hmms_from_data, +) + from dataset import DataSet MAX_NUM_HIDDEN_STATES = 8 + class Robot(tftask.ChartTask): + def get_name(self): - return "Robot experiments -- what path did the robot take to see these sequences of colors?" - + return "Robot experiments -- what path did the robot take to " + "see these sequences of colors?" + def get_priority(self): return 1 + def get_description(self): - return ("Train an HMM for each robot condition, without and with momentum.") - + return ("Train an HMM for each robot condition," + "without and with momentum.") + def task(self): - data_filename = "robot_no_momentum.data" + data_filename = "robot_no_momentum.data" hmm, d = train_hmm_from_data(data_filename) err_full = run_viterbi(hmm, d) - + data_filename_m = "robot_with_momentum.data" hmm_m, d_m = train_hmm_from_data(data_filename_m) err_full_m = run_viterbi(hmm_m, d_m) - - listNames = ["Without momentum", "With momentum"] - listData = [1-err_full, 1-err_full_m] - chart = {"chart": {"defaultSeriesType": "column"}, + + listNames = ["Without momentum", "With momentum"] + listData = [1 - err_full, 1 - err_full_m] + chart = {"chart": {"defaultSeriesType": "column"}, "xAxis": {"categories": listNames}, "yAxis": {"title": {"text": "Fraction Correct"}}, - "title": {"text": "HMM performance on infering robot location."}, - "series": [ {"name": "Test set performance", - "data": listData} ] } + "title": {"text": "HMM performance on" + " inferring robot location."}, + "series": [{"name": "Test set performance", + "data": listData}]} return chart - class WeatherStates_boston_la(tftask.ChartTask): + def get_name(self): - return "1st Weather experiments -- which city has this weather, boston or LA?" - + return "1st Weather experiments -- which city has" + " this weather, boston or LA?" + def get_priority(self): return 2 - + def get_description(self): - return ("Train HMMs with different number of hidden states, and see how well they can distinguish between the weather of different cities.") - + return ("Train HMMs with different number of hidden states," + " and see how well they can distinguish between" + "the weather of different cities.") + def task(self): - num_states = range(1, MAX_NUM_HIDDEN_STATES) - filename = "weather_bos_la.data" - - dataset_performance = [] - for N in num_states: - hmms, dataset = train_N_state_hmms_from_data(filename, N) - fraction_incorrect = compute_classification_performance(hmms, dataset) - dataset_performance.append(1-fraction_incorrect) - - chart = {"chart": {"defaultSeriesType": "line"}, - "xAxis": {"title": {"text": "number of hidden states"}, - "categories": num_states}, - "yAxis": {"title": {"text": "Fraction Correct"}, - "min":0.0, "max":1.0}, - "title": {"text": "HMM performance on classifying weather sequences by city"}, + num_states = range(1, MAX_NUM_HIDDEN_STATES) + filename = "weather_bos_la.data" + + dataset_performance = [] + for N in num_states: + hmms, dataset = train_N_state_hmms_from_data(filename, N) + fraction_incorrect = compute_classification_performance( + hmms, dataset) + dataset_performance.append(1 - fraction_incorrect) + + chart = {"chart": {"defaultSeriesType": "line"}, + "xAxis": {"title": {"text": "number of hidden states"}, + "categories": num_states}, + "yAxis": {"title": {"text": "Fraction Correct"}, + "min": 0.0, "max": 1.0}, + "title": {"text": "HMM performance " + "on classifying weather sequences by city"}, "series": [{"name": "Boston_LA", - "data": dataset_performance}]} - return chart - - + "data": dataset_performance}]} + return chart + + #listNames = ["Boston_LA", "Boston_Seattle", "Boston_Phoenix_Seattle_LA"] class WeatherStates_boston_seattle(tftask.ChartTask): + def get_name(self): - return "2nd Weather experiments -- which city has this weather, boston or seattle?" - + return "2nd Weather experiments -- which city has" + " this weather, boston or seattle?" + def get_priority(self): return 3 - + def get_description(self): - return ("Train HMMs with different number of hidden states, and see how well they can distinguish between the weather of different cities.") - + return ("Train HMMs with different number of hidden states, and " + "see how well they can distinguish between" + " the weather of different cities.") + def task(self): - num_states = range(1, MAX_NUM_HIDDEN_STATES) - filename = "weather_bos_sea.data" - - dataset_performance = [] - for N in num_states: - hmms, dataset = train_N_state_hmms_from_data(filename, N) - fraction_incorrect = compute_classification_performance(hmms, dataset) - dataset_performance.append(1-fraction_incorrect) - - chart = {"chart": {"defaultSeriesType": "line"}, - "xAxis": {"title": {"text": "number of hidden states"}, - "categories": num_states}, - "yAxis": {"title": {"text": "Fraction Correct"}, - "min":0.0, "max":1.0}, - "title": {"text": "HMM performance on classifying weather sequences by city"}, + num_states = range(1, MAX_NUM_HIDDEN_STATES) + filename = "weather_bos_sea.data" + + dataset_performance = [] + for N in num_states: + hmms, dataset = train_N_state_hmms_from_data(filename, N) + fraction_incorrect = compute_classification_performance( + hmms, dataset) + dataset_performance.append(1 - fraction_incorrect) + + chart = {"chart": {"defaultSeriesType": "line"}, + "xAxis": {"title": {"text": "number of hidden states"}, + "categories": num_states}, + "yAxis": {"title": {"text": "Fraction Correct"}, + "min": 0.0, "max": 1.0}, + "title": {"text": "HMM performance on classifying" + " weather sequences by city"}, "series": [{"name": "Boston_Seattle", - "data": dataset_performance}]} - return chart - - - + "data": dataset_performance}]} + return chart + + class WeatherStates_all(tftask.ChartTask): + def get_name(self): - return "3rd Weather experiments -- which city has this weather, boston, seattle, LA, phoenix?" - + return "3rd Weather experiments -- which city has this weather," + " boston, seattle, LA, phoenix?" + def get_priority(self): return 4 - + def get_description(self): - return ("Train HMMs with different number of hidden states, and see how well they can distinguish between the weather of different cities.") - + return ("Train HMMs with different number of hidden states, and " + "see how well they can distinguish between the weather" + " of different cities.") + def task(self): - num_states = range(1, MAX_NUM_HIDDEN_STATES) - filename = "weather_all.data" - - dataset_performance = [] - for N in num_states: - hmms, dataset = train_N_state_hmms_from_data(filename, N) - fraction_incorrect = compute_classification_performance(hmms, dataset) - dataset_performance.append(1-fraction_incorrect) - - chart = {"chart": {"defaultSeriesType": "line"}, - "xAxis": {"title": {"text": "number of hidden states"}, - "categories": num_states}, - "yAxis": {"title": {"text": "Fraction Correct"}, - "min":0.0, "max":1.0}, - "title": {"text": "HMM performance on classifying weather sequences by city"}, + num_states = range(1, MAX_NUM_HIDDEN_STATES) + filename = "weather_all.data" + + dataset_performance = [] + for N in num_states: + hmms, dataset = train_N_state_hmms_from_data(filename, N) + fraction_incorrect = compute_classification_performance( + hmms, dataset) + dataset_performance.append(1 - fraction_incorrect) + + chart = {"chart": {"defaultSeriesType": "line"}, + "xAxis": {"title": {"text": "number of hidden states"}, + "categories": num_states}, + "yAxis": {"title": {"text": "Fraction Correct"}, + "min": 0.0, "max": 1.0}, + "title": {"text": "HMM performance on classifying" + " weather sequences by city"}, "series": [{"name": "Boston_Phoenix_Seattle_LA", - "data": dataset_performance}]} - return chart - - - - + "data": dataset_performance}]} + return chart + + class BostonLikelihood(tftask.ChartTask): + def get_name(self): return "4th Weather experiments -- how is the weather here in boston?" - + def get_priority(self): return 5 - + def get_description(self): - return ("Train HMMs with different number of hidden states to see how many hidden states does boston need to model its weather.") - + return ("Train HMMs with different number of hidden states to " + "see how many hidden states does boston" + "need to model its weather.") + def task(self): - num_states = range(1, MAX_NUM_HIDDEN_STATES) - - filename = "weather_bos_la.data" - dataset = DataSet(filename) - category_seqs = split_into_categories(dataset) - boston_seqs = category_seqs["boston"] - - likelihoods = [] - for N in num_states: - model = HMM(range(N), dataset.outputs) - ll = model.learn_from_observations(boston_seqs, False, True) - likelihoods.append(ll[-1]) - - chart = {"chart": {"defaultSeriesType": "line"}, - "xAxis": {"title": {"text": "number of hidden states"}, - "categories": num_states}, + num_states = range(1, MAX_NUM_HIDDEN_STATES) + + filename = "weather_bos_la.data" + dataset = DataSet(filename) + category_seqs = split_into_categories(dataset) + boston_seqs = category_seqs["boston"] + + likelihoods = [] + for N in num_states: + model = HMM(range(N), dataset.outputs) + ll = model.learn_from_observations(boston_seqs, False, True) + likelihoods.append(ll[-1]) + + chart = {"chart": {"defaultSeriesType": "line"}, + "xAxis": {"title": {"text": "number of hidden states"}, + "categories": num_states}, "yAxis": {"title": {"text": "Fraction Correct"}}, - "title": {"text": "log likelihood of HMMs modeling boston weather"}, + "title": {"text": "log likelihood of HMMs" + " modeling boston weather"}, "series": [{"name": "boston training data", - "data": likelihoods}]} - - return chart + "data": likelihoods}]} + + return chart - def main(argv): return tftask.main() @@ -185,5 +213,3 @@ def main(argv): if __name__ == "__main__": import sys sys.exit(main(sys.argv)) - - diff --git a/HMM/task_hmm.pyc b/HMM/task_hmm.pyc deleted file mode 100644 index 19b51d1..0000000 Binary files a/HMM/task_hmm.pyc and /dev/null differ diff --git a/HMM/test_hmm.py b/HMM/test_hmm.py index 694947d..5b47515 100755 --- a/HMM/test_hmm.py +++ b/HMM/test_hmm.py @@ -4,160 +4,168 @@ test_hmm.py -- unit tests for hmms implemented in hmm.py """ -from hmm import * -from viterbi import * -from dataset import DataSet -#from util import * -import functools -import math +from hmm import ( + shape, + format_array_print, + get_alpha, + get_gamma, + HMM, max_delta, + array, ones, get_beta, baumwelch) +from viterbi import run_viterbi, train_hmm_from_data +from util import normalize_filename import unittest + class HMMsTest(unittest.TestCase): # test for learn_from_labeled_data() + def test_simple_hmm_learning(self): - state_seq = [[0,1,1,0,1,0,1,1], [0,0,1,0]] - obs_seq = [[0,0,1,1,0,0,0,1], [0,1,0,0]] - hmm = HMM(range(2), range(2)) - hmm.learn_from_labeled_data(state_seq, obs_seq) - print hmm - eps = 0.00001 - self.assertTrue(max_delta(hmm.initial, [0.750000,0.250000]) < eps) - self.assertTrue(max_delta(hmm.transition, - [[0.285714, 0.714286], - [0.571429, 0.428571]]) < eps) - self.assertTrue(max_delta(hmm.observation, - [[0.625000, 0.375000], - [0.625000, 0.375000]]) < eps) - - + state_seq = [[0, 1, 1, 0, 1, 0, 1, 1], [0, 0, 1, 0]] + obs_seq = [[0, 0, 1, 1, 0, 0, 0, 1], [0, 1, 0, 0]] + hmm = HMM(range(2), range(2)) + hmm.learn_from_labeled_data(state_seq, obs_seq) + print hmm + eps = 0.00001 + self.assertTrue(max_delta(hmm.initial, [0.750000, 0.250000]) < eps) + self.assertTrue(max_delta(hmm.transition, + [[0.285714, 0.714286], + [0.571429, 0.428571]]) < eps) + self.assertTrue(max_delta(hmm.observation, + [[0.625000, 0.375000], + [0.625000, 0.375000]]) < eps) + + def simple_weather_model(): - hmm = HMM(['s1','s2'], ['R','NR']) + hmm = HMM(['s1', 's2'], ['R', 'NR']) init = [0.7, 0.3] - trans = [[0.8,0.2], - [0.1,0.9]] - observ = [[0.75,0.25], - [0.4,0.6]] + trans = [[0.8, 0.2], + [0.1, 0.9]] + observ = [[0.75, 0.25], + [0.4, 0.6]] hmm.set_hidden_model(init, trans, observ) return hmm + class ViterbiTest(unittest.TestCase): + def toy_model(self): - hmm = HMM(['s1','s2'], ['R','NR']) - init = [0.5, 0.5] - trans = [[0.2,0.8], - [0.8,0.2]] - observ = [[0.8,0.2], - [0.2,0.8]] - hmm.set_hidden_model(init, trans, observ) - return hmm - - + hmm = HMM(['s1', 's2'], ['R', 'NR']) + init = [0.5, 0.5] + trans = [[0.2, 0.8], + [0.8, 0.2]] + observ = [[0.8, 0.2], + [0.2, 0.8]] + hmm.set_hidden_model(init, trans, observ) + return hmm + def test_viterbi_simple_sequence(self): - hmm = simple_weather_model() - print "*******************************************" - print hmm - print "******************************************" - seq = [1, 1, 0] # NR, NR, R - hidden_seq = hmm.most_likely_states(seq) - print "most likely states for [NR, NR, R] = %s" % hidden_seq - self.assertEqual(hidden_seq, [1,1,1]) + hmm = simple_weather_model() + print "*******************************************" + print hmm + print "******************************************" + seq = [1, 1, 0] # NR, NR, R + hidden_seq = hmm.most_likely_states(seq) + print "most likely states for [NR, NR, R] = %s" % hidden_seq + self.assertEqual(hidden_seq, [1, 1, 1]) def test_viterbi_long_sequence(self): - hmm = self.toy_model() - N = 10 - seq = [1,0,1,0,1,0,1,1,0] * 400 - hidden_seq = hmm.most_likely_states(seq, False) - # Check if we got right answer from the version with logs. - self.assertEqual(hidden_seq[2000:2010], [1, 0, 1, 0, 1, 1, 0, 1, 0, 1]) + hmm = self.toy_model() + #N = 10 + seq = [1, 0, 1, 0, 1, 0, 1, 1, 0] * 400 + hidden_seq = hmm.most_likely_states(seq, False) + # Check if we got right answer from the version with logs. + self.assertEqual(hidden_seq[2000:2010], [1, 0, 1, 0, 1, 1, 0, 1, 0, 1]) class RobotTest(unittest.TestCase): + def test_small_robot_dataset(self): - data_filename = "robot_small.data" - data_filename = normalize_filename(data_filename) - hmm, d = train_hmm_from_data(data_filename) - err_full = run_viterbi(hmm, d, True) - self.assertAlmostEqual(err_full, 2.0/9) - + data_filename = "robot_small.data" + data_filename = normalize_filename(data_filename) + hmm, d = train_hmm_from_data(data_filename) + err_full = run_viterbi(hmm, d, True) + self.assertAlmostEqual(err_full, 2.0 / 9) + + class BaumWelchTest(unittest.TestCase): + def setUp(self): - # Initialize things to specific values, for testing - N = 3 # num hidden states - # Normalized below - transition = array([[1.0,1.0,1.0], - [1.0,1.0,1.0], - [1.0,1.0,1.0]]) - observation = array([[1.0,1.0], [3.0,1.0], [1.0,3.0]]) - initial = ones([N]) - - # Normalize - initial = initial/sum(initial) - for i in range(N): - transition[i,:] = transition[i,:]/sum(transition[i,:]) - observation[i,:] = observation[i,:]/sum(observation[i,:]) - self.model = (initial, transition, observation) - self.seq = [0, 0, 0, 0, 1, 0, 1, 1, 1, 1] - - + # Initialize things to specific values, for testing + N = 3 # num hidden states + # Normalized below + transition = array([[1.0, 1.0, 1.0], + [1.0, 1.0, 1.0], + [1.0, 1.0, 1.0]]) + observation = array([[1.0, 1.0], [3.0, 1.0], [1.0, 3.0]]) + initial = ones([N]) + + # Normalize + initial = initial / sum(initial) + for i in range(N): + transition[i, :] = transition[i, :]/sum(transition[i, :]) + observation[i, :] = observation[i, :]/sum(observation[i, :]) + self.model = (initial, transition, observation) + self.seq = [0, 0, 0, 0, 1, 0, 1, 1, 1, 1] + def test_bw_beta_all_equal(self): - # check that all betas are the same - # get_beta() function already implemented in hmm.py as part of the support code - - beta = get_beta(self.seq, self.model) - print "beta: " - format_array_print (beta) - num_rows = shape(beta)[0] - num_cols = shape(beta)[1] - for r in range(num_rows): - for c in range(num_cols): - self.assertAlmostEqual(beta[r,c], 1.0/3) - + # check that all betas are the same + # get_beta() function already implemented in hmm.py as part of the + # support code + + beta = get_beta(self.seq, self.model) + print "beta: " + format_array_print(beta) + num_rows = shape(beta)[0] + num_cols = shape(beta)[1] + for r in range(num_rows): + for c in range(num_cols): + self.assertAlmostEqual(beta[r, c], 1.0 / 3) + def test_bw_gamma_first_col_equal(self): - # check that the first column of gamma are the same - # get_gamma(), get_beta(), get_alpha() functions already implemented in hmm.py as part of the support code - alpha, logp = get_alpha(self.seq, self.model) - beta = get_beta(self.seq, self.model) - gamma = get_gamma(alpha, beta) - print "gamma: " - format_array_print (gamma) - gamma_first_value = gamma[0,0] - num_rows = shape(gamma)[0] - for r in range(1, num_rows): - self.assertAlmostEqual(gamma_first_value, gamma[r, 0]) - - # Run EM on this one sequence, with the initial model above. - - #model = (initial, transition, observation) - #baumwelch(seq, 3, 2, 1, True, model) - + # check that the first column of gamma are the same + # get_gamma(), get_beta(), get_alpha() functions already implemented in + # hmm.py as part of the support code + alpha, logp = get_alpha(self.seq, self.model) + beta = get_beta(self.seq, self.model) + gamma = get_gamma(alpha, beta) + print "gamma: " + format_array_print(gamma) + gamma_first_value = gamma[0, 0] + num_rows = shape(gamma)[0] + for r in range(1, num_rows): + self.assertAlmostEqual(gamma_first_value, gamma[r, 0]) + + # Run EM on this one sequence, with the initial model above. + + #model = (initial, transition, observation) + #baumwelch(seq, 3, 2, 1, True, model) + + class BaumWelchWeatherTest(unittest.TestCase): + def setUp(self): - weather_hmm = simple_weather_model() - self.seqs = [[0, 0], [1, 1, 0]] - self.init_model = weather_hmm.get_model() - import hmm - # turn off smoothing just for unit test, to match numbers from lecture notes - hmm.PRODUCTION = False - - + weather_hmm = simple_weather_model() + self.seqs = [[0, 0], [1, 1, 0]] + self.init_model = weather_hmm.get_model() + import hmm + # turn off smoothing just for unit test, to match numbers from lecture + # notes + hmm.PRODUCTION = False + def test_bw_simple_weather_model(self): - # example from lecture notes 15, p14, just runs one iteration here - model = baumwelch(self.seqs, 2, 2, 1, True, self.init_model) # just one EM iteration - (transition, observation, initial) = model - - eps = 0.0001 - self.assertTrue(max_delta(initial, [ 0.646592, 0.353408 ]) < eps) - - self.assertTrue(max_delta(transition, [[ 0.841285, 0.158715 ], - [ 0.127844, 0.872156 ]]) < eps) - self.assertTrue(max_delta(observation, [[ 0.731416, 0.268584 ], - [ 0.426629, 0.573371 ]]) < eps) - - - - - - + # example from lecture notes 15, p14, just runs one iteration here + # just one EM iteration + model = baumwelch(self.seqs, 2, 2, 1, True, self.init_model) + (transition, observation, initial) = model + + eps = 0.0001 + self.assertTrue(max_delta(initial, [0.646592, 0.353408]) < eps) + + self.assertTrue(max_delta(transition, [[0.841285, 0.158715], + [0.127844, 0.872156]]) < eps) + self.assertTrue(max_delta(observation, [[0.731416, 0.268584], + [0.426629, 0.573371]]) < eps) + + if __name__ == '__main__': unittest.main() - diff --git a/HMM/test_hmm.pyc b/HMM/test_hmm.pyc deleted file mode 100644 index fab7eba..0000000 Binary files a/HMM/test_hmm.pyc and /dev/null differ diff --git a/HMM/util.py b/HMM/util.py index f63b9d1..c894fff 100644 --- a/HMM/util.py +++ b/HMM/util.py @@ -4,21 +4,25 @@ import random from os import path + def normalize_filename(filename): return path.join(path.dirname(__file__), filename) + def print_timing(func): def wrapper(*arg): t1 = time.time() res = func(*arg) t2 = time.time() - print '%s took %0.3f ms' % (func.func_name, (t2-t1)*1000.0) + print '%s took %0.3f ms' % (func.func_name, (t2 - t1) * 1000.0) return res return wrapper + def array_to_string(a): return [str(x) for x in a] + def normalize(a): """Normalize the 1d array a. Must have non-zero sum""" return a / sum(a) @@ -37,6 +41,7 @@ def random_from_dist(ps): return i raise Exception("random_from_dist shouldn't run off the end of the array") + def custom_flatten(xs): """flatten a list that looks like [a,[b,[c,[d,[e]]]]] needed because the list can be hundreds of thousands of elements long, @@ -49,6 +54,7 @@ def custom_flatten(xs): result.append(xs[0]) return result + def flatten(x): """flatten(sequence) -> list @@ -69,4 +75,3 @@ def flatten(x): else: result.append(el) return result - diff --git a/HMM/util.pyc b/HMM/util.pyc deleted file mode 100644 index 2ee1f40..0000000 Binary files a/HMM/util.pyc and /dev/null differ diff --git a/HMM/viterbi.py b/HMM/viterbi.py index f302760..e9da4fe 100755 --- a/HMM/viterbi.py +++ b/HMM/viterbi.py @@ -6,11 +6,12 @@ from optparse import OptionParser import sys -from util import * +from util import normalize_filename, print_timing from dataset import DataSet from hmm import HMM -import sys +#import sys + @print_timing def run_viterbi(hmm, d, debug=False): @@ -18,47 +19,51 @@ def run_viterbi(hmm, d, debug=False): total_error = 0 total_n = 0 if debug: - print "\nRunning viterbi on each test sequence..." + print "\nRunning viterbi on each test sequence..." for i in range(len(d.test_output)): if debug: - print "Test sequence %d:" % i - errors = 0 - most_likely = [d.states[j] for j in hmm.most_likely_states(d.test_output[i])] - actual = [d.states[j] for j in d.test_state[i]] - n = len(most_likely) + print "Test sequence %d:" % i + errors = 0 + most_likely = [d.states[j] + for j in hmm.most_likely_states(d.test_output[i])] + actual = [d.states[j] for j in d.test_state[i]] + n = len(most_likely) # print "len(most_likely) = %d len(actual) = %d" % (n, len(actual)) - for j in range(n): - if debug: - print "%s %s %s" % ( - actual[j], most_likely[j], d.outputs[d.test_output[i][j]]) - if actual[j] != most_likely[j]: - errors += 1 - if debug: - print "errors: %d / %d = %.3f\n" % (errors, n, errors * 1.0 / n) - total_error += errors - total_n += n + for j in range(n): + if debug: + print "%s %s %s" % ( + actual[j], most_likely[j], d.outputs[d.test_output[i][j]]) + if actual[j] != most_likely[j]: + errors += 1 + if debug: + print "errors: %d / %d = %.3f\n" % ( + errors, n, errors * 1.0 / n) + total_error += errors + total_n += n - err = total_error * 1.0 / total_n + err = total_error * 1.0 / total_n if debug: - print "Total mistakes = %d / %d = %f" % (total_error, total_n, err) + print "Total mistakes = %d / %d = %f" % (total_error, total_n, err) return err + def train_hmm_from_data(data_filename, debug=False): if debug: - print "\n\nReading dataset %s ..." % data_filename + print "\n\nReading dataset %s ..." % data_filename data_filename = normalize_filename(data_filename) d = DataSet(data_filename) - #if options.verbose: + # if options.verbose: # print d if debug: - print "Building an HMM from the full training data..." + print "Building an HMM from the full training data..." hmm = HMM(d.states, d.outputs) hmm.learn_from_labeled_data(d.train_state, d.train_output) if debug: - print "The model:" - print hmm + print "The model:" + print hmm return (hmm, d) - + + def main(argv=None): if argv is None: argv = sys.argv @@ -72,10 +77,10 @@ def main(argv=None): (options, args) = parser.parse_args(argv[1:]) if len(args) != 1: parser.error("Must pass in a datafile") - - hmm, d = train_hmm_from_data(args[0], options.verbose) - err_full = run_viterbi(hmm, d , True) + hmm, d = train_hmm_from_data(args[0], options.verbose) + err_full = run_viterbi(hmm, d, True) + print(err_full) return 0 if __name__ == "__main__": diff --git a/HMM/viterbi.pyc b/HMM/viterbi.pyc deleted file mode 100644 index 6d18f9e..0000000 Binary files a/HMM/viterbi.pyc and /dev/null differ diff --git a/K-Means Clustering and PCA/mlclass-ex7/octave-core b/K-Means Clustering and PCA/mlclass-ex7/octave-core deleted file mode 100644 index dac0bd0..0000000 Binary files a/K-Means Clustering and PCA/mlclass-ex7/octave-core and /dev/null differ diff --git a/K-Means Clustering and PCA/ex7.pdf b/KMeansClusteringandPCA/ex7.pdf similarity index 100% rename from K-Means Clustering and PCA/ex7.pdf rename to KMeansClusteringandPCA/ex7.pdf diff --git a/KMeansClusteringandPCA/ex7.txt b/KMeansClusteringandPCA/ex7.txt new file mode 100644 index 0000000..cbb1f8b --- /dev/null +++ b/KMeansClusteringandPCA/ex7.txt @@ -0,0 +1,633 @@ +Programming Exercise 7: +K-means Clustering and Principal Component +Analysis +Machine Learning +November 29, 2011 + +Introduction +In this exercise, you will implement the K-means clustering algorithm and +apply it to compress an image. In the second part, you will use principal +component analysis to find a low-dimensional representation of face images. +Before starting on the programming exercise, we strongly recommend watching the video lectures and completing the review questions for the associated +topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. + +Files included in this exercise +ex7.m - Octave/Matlab script for the first exercise on K-means +ex7 pca.m - Octave/Matlab script for the second exercise on PCA +ex7data1.mat - Example Dataset for PCA +ex7data2.mat - Example Dataset for K-means +ex7faces.mat - Faces Dataset +bird small.png - Example Image +displayData.m - Displays 2D data stored in a matrix +drawLine.m - Draws a line over an exsiting figure +plotDataPoints.m - Initialization for K-means centroids +plotProgresskMeans.m - Plots each step of K-means as it proceeds +1 + + runkMeans.m - Runs the K-means algorithm +[ ] pca.m - Perform principal component analysis +[ ] projectData.m - Projects a data set into a lower dimensional space +[ ] recoverData.m - Recovers the original data from the projection +[ ] findClosestCentroids.m - Find closest centroids (used in K-means) +[ ] computeCentroids.m - Compute centroid means (used in K-means) +[ ] kMeansInitCentroids.m - Initialization for K-means centroids +indicates files you will need to complete +Throughout the first part of the exercise, you will be using the script +ex7.m, for the second part you will use ex7 pca.m. These scripts set up the +dataset for the problems and make calls to functions that you will write. +You are only required to modify functions in other files, by following the +instructions in this assignment. + +Where to get help +We also strongly encourage using the online Q&A Forum to discuss exercises with other students. However, do not look at any source code written +by others or share your source code with others. +If you run into network errors using the submit script, you can also use +an online form for submitting your solutions. To use this alternative submission interface, run the submitWeb script to generate a submission file (e.g., +submit ex7 part2.txt). You can then submit this file through the web +submission form in the programming exercises page (go to the programming +exercises page, then select the exercise you are submitting for). If you are +having no problems submitting through the standard submission system using the submit script, you do not need to use this alternative submission +interface. + +1 + +K-means Clustering + +In this this exercise, you will implement the K-means algorithm and use it +for image compression. You will first start on an example 2D dataset that +will help you gain an intuition of how the K-means algorithm works. After +that, you wil use the K-means algorithm for image compression by reducing +2 + + the number of colors that occur in an image to only those that are most +common in that image. You will be using ex7.m for this part of the exercise. + +1.1 + +Implementing K-means + +The K-means algorithm is a method to automatically cluster similar data +examples together. Concretely, you are given a training set {x(1) , ..., x(m) } +(where x(i) ∈ Rn ), and want to group the data into a few cohesive “clusters”. +The intuition behind K-means is an iterative procedure that starts by guessing the initial centroids, and then refines this guess by repeatedly assigning +examples to their closest centroids and then recomputing the centroids based +on the assignments. +The K-means algorithm is as follows: +% Initialize centroids +centroids = kMeansInitCentroids(X, K); +for iter = 1:iterations +% Cluster assignment step: Assign each data point to the +% closest centroid. idx(i) corresponds to cˆ(i), the index +% of the centroid assigned to example i +idx = findClosestCentroids(X, centroids); +% Move centroid step: Compute means based on centroid +% assignments +centroids = computeMeans(X, idx, K); +end + +The inner-loop of the algorithm repeatedly carries out two steps: (i) Assigning each training example x(i) to its closest centroid, and (ii) Recomputing the mean of each centroid using the points assigned to it. The K-means +algorithm will always converge to some final set of means for the centroids. +Note that the converged solution may not always be ideal and depends on the +initial setting of the centroids. Therefore, in practice the K-means algorithm +is usually run a few times with different random initializations. One way to +choose between these different solutions from different random initializations +is to choose the one with the lowest cost function value (distortion). +You will implement the two phases of the K-means algorithm separately +in the next sections. + +3 + + 1.1.1 + +Finding closest centroids + +In the “cluster assignment” phase of the K-means algorithm, the algorithm +assigns every training example x(i) to its closest centroid, given the current +positions of centroids. Specifically, for every example i we set +c(i) := j + +that minimizes ||x(i) − µj ||2 , + +where c(i) is the index of the centroid that is closest to x(i) , and µj is the +position (value) of the j’th centroid. Note that c(i) corresponds to idx(i) in +the starter code. +Your task is to complete the code in findClosestCentroids.m. This +function takes the data matrix X and the locations of all centroids inside +centroids and should output a one-dimensional array idx that holds the +index (a value in {1, ..., K}, where K is total number of centroids) of the +closest centroid to every training example. +You can implement this using a loop over every training example and +every centroid. +Once you have completed the code in findClosestCentroids.m, the +script ex7.m will run your code and you should see the output [1 3 2] +corresponding to the centroid assignments for the first 3 examples. +You should now submit your “finding closest centroids” function. +1.1.2 + +Computing centroid means + +Given assignments of every point to a centroid, the second phase of the +algorithm recomputes, for each centroid, the mean of the points that were +assigned to it. Specifically, for every centroid k we set +µk := + +1 +x(i) +|Ck | i∈C +k + +where Ck is the set of examples that are assigned to centroid k. Concretely, +if two examples say x(3) and x(5) are assigned to centroid k = 2, then you +should update µ2 = 21 (x(3) + x(5) ). +You should now complete the code in computeCentroids.m. You can +implement this function using a loop over the centroids. You can also use a +loop over the examples; but if you can use a vectorized implementation that +does not use such a loop, your code may run faster. +4 + + Once you have completed the code in computeCentroids.m, the script +ex7.m will run your code and output the centroids after the first step of Kmeans. +You should now submit your compute centroids function. + +1.2 + +K-means on example dataset +Iteration number 10 + +6 + +5 + +4 + +3 + +2 + +1 + +0 +−1 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +8 + +9 + +Figure 1: The expected output. +After you have completed the two functions (findClosestCentroids and +computeCentroids), the next step in ex7.m will run the K-means algorithm +on a toy 2D dataset to help you understand how K-means works. Your +functions are called from inside the runKmeans.m script. We encourage you +to take a look at the function to understand how it works. Notice that the +code calls the two functions you implemented in a loop. +When you run the next step, the K-means code will produce a visualization that steps you through the progress of the algorithm at each iteration. +Press enter multiple times to see how each step of the K-means algorithm +changes the centroids and cluster assignments. At the end, your figure should +look as the one displayed in Figure 1. + +5 + + 1.3 + +Random initialization + +The initial assignments of centroids for the example dataset in ex7.m were +designed so that you will see the same figure as in Figure 1. In practice, a +good strategy for initializing the centroids is to select random examples from +the training set. +In this part of the exercise, you should complete the function kMeansInitCentroids.m +with the following code: +% Initialize the centroids to be random examples +% Randomly reorder the indices of examples +randidx = randperm(size(X, 1)); +% Take the first K examples as centroids +centroids = X(randidx(1:K), :); + +The code above first randomly permutes the indices of the examples (using randperm). Then, it selects the first K examples based on the random +permutation of the indices. This allows the examples to be selected at random without the risk of selecting the same example twice. +You do not need to make any submissions for this part of the exercise. + +1.4 + +Image compression with K-means + +Figure 2: The original 128x128 image. +In this exercise, you will apply K-means to image compression. In a + +6 + + straightforward 24-bit color representation of an image,1 each pixel is represented as three 8-bit unsigned integers (ranging from 0 to 255) that specify +the red, green and blue intensity values. This encoding is often refered to as +the RGB encoding. Our image contains thousands of colors, and in this part +of the exercise, you will reduce the number of colors to 16 colors. +By making this reduction, it is possible to represent (compress) the photo +in an efficient way. Specifically, you only need to store the RGB values of +the 16 selected colors, and for each pixel in the image you now need to only +store the index of the color at that location (where only 4 bits are necessary +to represent 16 possibilities). +In this exercise, you will use the K-means algorithm to select the 16 colors +that will be used to represent the compressed image. Concretely, you will +treat every pixel in the original image as a data example and use the K-means +algorithm to find the 16 colors that best group (cluster) the pixels in the 3dimensional RGB space. Once you have computed the cluster centroids on +the image, you will then use the 16 colors to replace the pixels in the original +image. +1.4.1 + +K-means on pixels + +In Matlab and Octave, images can be read in as follows: +% Load 128x128 color image (bird small.png) +A = imread('bird small.png'); +% You will need to have installed the image package to used +% imread. If you do not have the image package installed, you +% should instead change the following line to +% +% +load('bird small.mat'); % Loads the image into the variable A + +This creates a three-dimensional matrix A whose first two indices identify +a pixel position and whose last index represents red, green, or blue. For +example, A(50, 33, 3) gives the blue intensity of the pixel at row 50 and +column 33. +The code inside ex7.m first loads the image, and then reshapes it to create +an m × 3 matrix of pixel colors (where m = 16384 = 128 × 128), and calls +your K-means function on it. +After finding the top K = 16 colors to represent the image, you can now +1 +The provided photo used in this exercise belongs to Frank Wouters and is used with +his permission. + +7 + + assign each pixel position to its closest centroid using the findClosestCentroids +function. This allows you to represent the original image using the centroid +assignments of each pixel. Notice that you have significantly reduced the +number of bits that are required to describe the image. The original image +required 24 bits for each one of the 128×128 pixel locations, resulting in total +size of 128 × 128 × 24 = 393, 216 bits. The new representation requires some +overhead storage in form of a dictionary of 16 colors, each of which require +24 bits, but the image itself then only requires 4 bits per pixel location. The +final number of bits used is therefore 16 × 24 + 128 × 128 × 4 = 65, 920 bits, +which corresponds to compressing the original image by about a factor of 6. + +1 +Figure 3: Original and reconstructed image (when using K-means to compress the image). +Finally, you can view the effects of the compression by reconstructing the +image based only on the centroid assignments. Specifically, you can replace +each pixel location with the mean of the centroid assigned to it. Figure 3 +shows the reconstruction we obtained. Even though the resulting image retains most of the characteristics of the original, we also see some compression +artifacts. +You do not need to make any submissions for this part of the exercise. + +1.5 + +Optional (ungraded) exercise: Use your own image + +In this exercise, modify the code we have supplied to run on one of your +own images. Note that if your image is very large, then K-means can take a +long time to run. Therefore, we recommend that you resize your images to +managable sizes before running the code. You can also try to vary K to see +the effects on the compression. +8 + + 9 + + 2 + +Principal Component Analysis + +In this exercise, you will use principal component analysis (PCA) to perform +dimensionality reduction. You will first experiment with an example 2D +dataset to get intuition on how PCA works, and then use it on a bigger +dataset of 5000 face image dataset. +The provided script, ex7 pca.m, will help you step through the first half +of the exercise. + +2.1 + +Example Dataset + +To help you understand how PCA works, you will first start with a 2D dataset +which has one direction of large variation and one of smaller variation. The +script ex7 pca.m will plot the training data (Figure 4). In this part of the +exercise, you will visualize what happens when you use PCA to reduce the +data from 2D to 1D. In practice, you might want to reduce data from 256 to +50 dimensions, say; but using lower dimensional data in this example allows +us to visualize the algorithms better. +8 + +7 + +6 + +5 + +4 + +3 + +2 + +1 + +2 + +3 + +4 + +5 + +6 + +Figure 4: Example Dataset 1 + +2.2 + +Implementing PCA + +In this part of the exercise, you will implement PCA. PCA consists of +two computational steps: First, you compute the covariance matrix of the +10 + + data. Then, you use Octave’s SVD function to compute the eigenvectors +U1 , U2 , . . . , Un . These will correspond to the principal components of variation in the data. +Before using PCA, it is important to first normalize the data by subtracting the mean value of each feature from the dataset, and scaling each dimension so that they are in the same range. In the provided script ex7 pca.m, +this normalization has been performed for you using the featureNormalize +function. +After normalizing the data, you can run PCA to compute the principal +components. You task is to complete the code in pca.m to compute the principal components of the dataset. First, you should compute the covariance +matrix of the data, which is given by: +1 T +X X +m +where X is the data matrix with examples in rows, and m is the number of +examples. Note that Σ is a n × n matrix and not the summation operator. +After computing the covariance matrix, you can run SVD on it to compute +the principal components. In Octave, you can run SVD with the following +command: [U, S, V] = svd(Sigma), where U will contain the principal +components and S will contain a diagonal matrix. +Σ= + +8 + +7 + +6 + +5 + +4 + +3 + +2 + +1 + +2 + +3 + +4 + +5 + +6 + +Figure 5: Computed eigenvectors of the dataset +Once you have completed pca.m, the ex7 pca.m script will run PCA on +the example dataset and plot the corresponding principal components found +11 + + (Figure 5). The script will also output the top principal component (eigenvector) found, and you should expect to see an output of about [-0.707 +-0.707]. (It is possible that Octave may instead output the negative of this, +since U1 and −U1 are equally valid choices for the first principal component.) +You should now submit your PCA function. + +2.3 + +Dimensionality Reduction with PCA + +After computing the principal components, you can use them to reduce the +feature dimension of your dataset by projecting each example onto a lower +dimensional space, x(i) → z (i) (e.g., projecting the data from 2D to 1D). In +this part of the exercise, you will use the eigenvectors returned by PCA and +project the example dataset into a 1-dimensional space. +In practice, if you were using a learning algorithm such as linear regression +or perhaps neural networks, you could now use the projected data instead +of the original data. By using the projected data, you can train your model +faster as there are less dimensions in the input. +2.3.1 + +Projecting the data onto the principal components + +You should now complete the code in projectData.m. Specifically, you are +given a dataset X, the principal components U, and the desired number of +dimensions to reduce to K. You should project each example in X onto the +top K components in U. Note that the top K components in U are given by +the first K columns of U, that is U reduce = U(:, 1:K). +Once you have completed the code in projectData.m, ex7 pca.m will +project the first example onto the first dimension and you should see a value +of about 1.481 (or possibly -1.481, if you got −U1 instead of U1 ). +You should now submit the project data function. +2.3.2 + +Reconstructing an approximation of the data + +After projecting the data onto the lower dimensional space, you can approximately recover the data by projecting them back onto the original high +dimensional space. Your task is to complete recoverData.m to project each +example in Z back onto the original space and return the recovered approximation in X rec. + +12 + + Once you have completed the code in projectData.m, ex7 pca.m will +recover an approximation of the first example and you should see a value of +about [-1.047 -1.047]. +You should now submit the recover data function. +2.3.3 + +Visualizing the projections +3 + +2 + +1 + +0 + +−1 + +−2 + +−3 + +−4 +−4 + +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +Figure 6: The normalized and projected data after PCA. +After completing both projectData and recoverData, ex7 pca.m will +now perform both the projection and approximate reconstruction to show +how the projection affects the data. In Figure 6, the original data points are +indicated with the blue circles, while the projected data points are indicated +with the red circles. The projection effectively only retains the information +in the direction given by U1 . + +2.4 + +Face Image Dataset + +In this part of the exercise, you will run PCA on face images to see how it +can be used in practice for dimension reduction. The dataset ex7faces.mat +contains a dataset2 X of face images, each 32 × 32 in grayscale. Each row +of X corresponds to one face image (a row vector of length 1024). The next +2 + +This dataset was based on a cropped version of the labeled faces in the wild dataset. + +13 + + step in ex7 pca.m will load and visualize the first 100 of these face images +(Figure 7). + +Figure 7: Faces dataset + +2.4.1 + +PCA on Faces + +To run PCA on the face dataset, we first normalize the dataset by subtracting +the mean of each feature from the data matrix X. The script ex7 pca.m will +do this for you and then run your PCA code. After running PCA, you will +obtain the principal components of the dataset. Notice that each principal +component in U (each row) is a vector of length n (where for the face dataset, +n = 1024). It turns out that we can visualize these principal components by +reshaping each of them into a 32 × 32 matrix that corresponds to the pixels +in the original dataset. The script ex7 pca.m displays the first 36 principal +components that describe the largest variations (Figure 8). If you want, you +can also change the code to display more principal components to see how +they capture more and more details. +2.4.2 + +Dimensionality Reduction + +Now that you have computed the principal components for the face dataset, +you can use it to reduce the dimension of the face dataset. This allows you to +use your learning algorithm with a smaller input size (e.g., 100 dimensions) +instead of the original 1024 dimensions. This can help speed up your learning +algorithm. +14 + + Figure 8: Principal components on the face dataset + +Figure 9: Original images of faces and ones reconstructed from only the top +100 principal components. +The next part in ex7 pca.m will project the face dataset onto only the +first 100 principal components. Concretely, each face image is now described +by a vector z (i) ∈ R100 . +To understand what is lost in the dimension reduction, you can recover +the data using only the projected dataset. In ex7 pca.m, an approximate +recovery of the data is performed and the original and projected face images +are displayed side by side (Figure 9). From the reconstruction, you can observe that the general structure and appearance of the face are kept while +the fine details are lost. This is a remarkable reduction (more than 10×) in +15 + + the dataset size that can help speed up your learning algorithm significantly. +For example, if you were training a neural network to perform person recognition (gven a face image, predict the identitfy of the person), you can use +the dimension reduced input of only a 100 dimensions instead of the original +pixels. + +2.5 + +Optional (ungraded) exercise: PCA for visualization + +Figure 10: Original data in 3D +In the earlier K-means image compression exercise, you used the K-means +algorithm in the 3-dimensional RGB space. In the last part of the ex7 pca.m +script, we have provided code to visualize the final pixel assignments in this +3D space using the scatter3 function. Each data point is colored according +to the cluster it has been assigned to. You can drag your mouse on the figure +to rotate and inspect this data in 3 dimensions. +It turns out that visualizing datasets in 3 dimensions or greater can be +cumbersome. Therefore, it is often desirable to only display the data in 2D +even at the cost of losing some information. In practice, PCA is often used to +reduce the dimensionality of data for visualization purposes. In the next part +of ex7 pca.m, the script will apply your implementation of PCA to the 3dimensional data to reduce it to 2 dimensions and visualize the result in a 2D +scatter plot. The PCA projection can be thought of as a rotation that selects +the view that maximizes the spread of the data, which often corresponds to +the “best” view. + +16 + + Figure 11: 2D visualization produced using PCA + +Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Submitted File +findClosestCentroids.m +computeCentroids.m +pca.m +projectData.m +recoverData.m + +Part +Find Closest Centroids +Compute Centroid Means +PCA +Project Data +Recover Data +Total Points + +Points +30 points +30 points +20 points +10 points +10 points +100 points + +You are allowed to submit your solutions multiple times, and we will take +only the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, December 4th +at 23:59:59 PDT. + +17 + + \ No newline at end of file diff --git a/K-Means Clustering and PCA/mlclass-ex7/bird_small.mat b/KMeansClusteringandPCA/mlclass-ex7/bird_small.mat similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/bird_small.mat rename to KMeansClusteringandPCA/mlclass-ex7/bird_small.mat diff --git a/K-Means Clustering and PCA/mlclass-ex7/bird_small.png b/KMeansClusteringandPCA/mlclass-ex7/bird_small.png similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/bird_small.png rename to KMeansClusteringandPCA/mlclass-ex7/bird_small.png diff --git a/K-Means Clustering and PCA/mlclass-ex7/computeCentroids.m b/KMeansClusteringandPCA/mlclass-ex7/computeCentroids.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/computeCentroids.m rename to KMeansClusteringandPCA/mlclass-ex7/computeCentroids.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/displayData.m b/KMeansClusteringandPCA/mlclass-ex7/displayData.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/displayData.m rename to KMeansClusteringandPCA/mlclass-ex7/displayData.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/drawLine.m b/KMeansClusteringandPCA/mlclass-ex7/drawLine.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/drawLine.m rename to KMeansClusteringandPCA/mlclass-ex7/drawLine.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/ex7.m b/KMeansClusteringandPCA/mlclass-ex7/ex7.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/ex7.m rename to KMeansClusteringandPCA/mlclass-ex7/ex7.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/ex7_pca.m b/KMeansClusteringandPCA/mlclass-ex7/ex7_pca.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/ex7_pca.m rename to KMeansClusteringandPCA/mlclass-ex7/ex7_pca.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/ex7data1.mat b/KMeansClusteringandPCA/mlclass-ex7/ex7data1.mat similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/ex7data1.mat rename to KMeansClusteringandPCA/mlclass-ex7/ex7data1.mat diff --git a/K-Means Clustering and PCA/mlclass-ex7/ex7data2.mat b/KMeansClusteringandPCA/mlclass-ex7/ex7data2.mat similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/ex7data2.mat rename to KMeansClusteringandPCA/mlclass-ex7/ex7data2.mat diff --git a/K-Means Clustering and PCA/mlclass-ex7/ex7faces.mat b/KMeansClusteringandPCA/mlclass-ex7/ex7faces.mat similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/ex7faces.mat rename to KMeansClusteringandPCA/mlclass-ex7/ex7faces.mat diff --git a/K-Means Clustering and PCA/mlclass-ex7/featureNormalize.m b/KMeansClusteringandPCA/mlclass-ex7/featureNormalize.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/featureNormalize.m rename to KMeansClusteringandPCA/mlclass-ex7/featureNormalize.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/findClosestCentroids.m b/KMeansClusteringandPCA/mlclass-ex7/findClosestCentroids.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/findClosestCentroids.m rename to KMeansClusteringandPCA/mlclass-ex7/findClosestCentroids.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/kMeansInitCentroids.m b/KMeansClusteringandPCA/mlclass-ex7/kMeansInitCentroids.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/kMeansInitCentroids.m rename to KMeansClusteringandPCA/mlclass-ex7/kMeansInitCentroids.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/pca.m b/KMeansClusteringandPCA/mlclass-ex7/pca.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/pca.m rename to KMeansClusteringandPCA/mlclass-ex7/pca.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/plotDataPoints.m b/KMeansClusteringandPCA/mlclass-ex7/plotDataPoints.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/plotDataPoints.m rename to KMeansClusteringandPCA/mlclass-ex7/plotDataPoints.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/plotProgresskMeans.m b/KMeansClusteringandPCA/mlclass-ex7/plotProgresskMeans.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/plotProgresskMeans.m rename to KMeansClusteringandPCA/mlclass-ex7/plotProgresskMeans.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/projectData.m b/KMeansClusteringandPCA/mlclass-ex7/projectData.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/projectData.m rename to KMeansClusteringandPCA/mlclass-ex7/projectData.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/recoverData.m b/KMeansClusteringandPCA/mlclass-ex7/recoverData.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/recoverData.m rename to KMeansClusteringandPCA/mlclass-ex7/recoverData.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/runkMeans.m b/KMeansClusteringandPCA/mlclass-ex7/runkMeans.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/runkMeans.m rename to KMeansClusteringandPCA/mlclass-ex7/runkMeans.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/submit.m b/KMeansClusteringandPCA/mlclass-ex7/submit.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/submit.m rename to KMeansClusteringandPCA/mlclass-ex7/submit.m diff --git a/K-Means Clustering and PCA/mlclass-ex7/submitWeb.m b/KMeansClusteringandPCA/mlclass-ex7/submitWeb.m similarity index 100% rename from K-Means Clustering and PCA/mlclass-ex7/submitWeb.m rename to KMeansClusteringandPCA/mlclass-ex7/submitWeb.m diff --git a/Lectures/aimlcs229/AI-classes.txt b/Lectures/aimlcs229/AI-classes.txt new file mode 100644 index 0000000..fccffcd --- /dev/null +++ b/Lectures/aimlcs229/AI-classes.txt @@ -0,0 +1,65 @@ +List of related AI Classes +CS229 covered a broad swath of topics in machine learning, compressed into a single quarter. Machine learning is a hugely inter-disciplinary topic, and there are many +other sub-communities of AI working on related topics, or working on applying machine +learning to different problems. +Stanford has one of the best and broadest sets of AI courses of pretty much any +university. It offers a wide range of classes, covering most of the scope of AI issues. Here +are some some classes in which you can learn more about topics related to CS229: +AI Overview +• CS221 (Aut): Artificial Intelligence: Principles and Techniques. Broad overview +of AI and applications, including robotics, vision, NLP, search, Bayesian networks, +and learning. Taught by Professor Andrew Ng. +Robotics +• CS223A (Win): Robotics from the perspective of building the robot and controlling +it; focus on manipulation. Taught by Professor Oussama Khatib (who builds the +big robots in the Robotics Lab). +• CS225A (Spr): A lab course from the same perspective, taught by Professor Khatib. +• CS225B (Aut): A lab course where you get to play around with making mobile +robots navigate in the real world. Taught by Dr. Kurt Konolige (SRI). +• CS277 (Spr): Experimental Haptics. Teaches haptics programming and touch +feedback in virtual reality. Taught by Professor Ken Salisbury, who works on +robot design, haptic devices/teleoperation, robotic surgery, and more. +• CS326A (Latombe): Motion planning. An algorithmic robot motion planning +course, by Professor Jean-Claude Latombe, who (literally) wrote the book on the +topic. +Knowledge Representation & Reasoning +• CS222 (Win): Logical knowledge representation and reasoning. Taught by Professor Yoav Shoham and Professor Johan van Benthem. +• CS227 (Spr): Algorithmic methods such as search, CSP, planning. Taught by Dr. +Yorke-Smith (SRI). +Probabilistic Methods +• CS228 (Win): Probabilistic models in AI. Bayesian networks, hidden Markov models, and planning under uncertainty. Taught by Professor Daphne Koller, who +works on computational biology, Bayes nets, learning, computational game theory, +and more. +1 + + Perception & Understanding +• CS223B (Win): Introduction to computer vision. Algorithms for processing and +interpreting image or camera information. Taught by Professor Sebastian Thrun, +who led the DARPA Grand Challenge/DARPA Urban Challenge teams, or Professor Jana Kosecka, who works on vision and robotics. +• CS224S (Win): Speech recognition and synthesis. Algorithms for large vocabulary continuous speech recognition, text-to-speech, conversational dialogue agents. +Taught by Professor Dan Jurafsky, who co-authored one of the two most-used +textbooks on NLP. +• CS224N (Spr): Natural language processing, including parsing, part of speech +tagging, information extraction from text, and more. Taught by Professor Chris +Manning, who co-authored the other of the two most-used textbooks on NLP. +• CS224U (Win): Natural language understanding, including computational semantics and pragmatics, with application to question answering, summarization, and +inference. Taught by Professors Dan Jurafsky and Chris Manning. +Multi-agent systems +• CS224M (Win): Multi-agent systems, including game theoretic foundations, designing systems that induce agents to coordinate, and multi-agent learning. Taught +by Professor Yoav Shoham, who works on economic models of multi-agent interactions. +• CS227B (Spr): General game playing. Reasoning and learning methods for playing +any of a broad class of games. Taught by Professor Michael Genesereth, who works +on computational logic, enterprise management and e-commerce. +Convex Optimization +• EE364A (Win): Convex Optimization. Convexity, duality, convex programs, interior point methods, algorithms. Taught by Professor Stephen Boyd, who works on +optimization and its application to engineering problems. +AI Project courses +• CS294B/CS294W (Win): STAIR (STanford AI Robot) project. Project course +with no lectures. By drawing from machine learning and all other areas of AI, +we’ll work on the challenge problem of building a general-purpose robot that can +carry out home and office chores, such as tidying up a room, fetching items, and +preparing meals. Taught by Professor Andrew Ng. + +2 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/ML-advice.txt b/Lectures/aimlcs229/ML-advice.txt new file mode 100644 index 0000000..f15a2ef --- /dev/null +++ b/Lectures/aimlcs229/ML-advice.txt @@ -0,0 +1,623 @@ +Advice for applying +Machine Learning +Andrew Ng +Stanford University + +Andrew Y. Ng + + Today’s Lecture + +• + +Advice on how getting learning algorithms to different applications. + +• + +Most of today’s material is not very mathematical. But it’s also some of the +hardest material in this class to understand. + +• + +Some of what I’ll say today is debatable. + +• + +Some of what I’ll say is not good advice for doing novel machine learning +research. + +• + +Key ideas: + +1. Diagnostics for debugging learning algorithms. +2. Error analyses and ablative analysis. +3. How to get started on a machine learning problem. +– Premature (statistical) optimization. + +Andrew Y. Ng + + Debugging Learning +Algorithms + +Andrew Y. Ng + + Debugging learning algorithms +Motivating example: +• Anti-spam. You carefully choose a small set of 100 words to use as +features. (Instead of using all 50000+ words in English.) +• Bayesian logistic regression, implemented with gradient descent, gets 20% +test error, which is unacceptably high. + +• What to do next? + +Andrew Y. Ng + + Fixing the learning algorithm + +• Bayesian logistic regression: + +• Common approach: Try improving the algorithm in different ways. +– +– +– +– +– +– +– +– + +Try getting more training examples. +Try a smaller set of features. +Try a larger set of features. +Try changing the features: Email header vs. email body features. +Run gradient descent for more iterations. +Try Newton’s method. +Use a different value for λ. +Try using an SVM. + +• This approach might work, but it’s very time-consuming, and largely a matter +of luck whether you end up fixing what the problem really is. + +Andrew Y. Ng + + Diagnostic for bias vs. variance + +Better approach: +– Run diagnostics to figure out what the problem is. +– Fix whatever the problem is. + +Bayesian logistic regression’s test error is 20% (unacceptably high). +Suppose you suspect the problem is either: +– Overfitting (high variance). +– Too few features to classify spam (high bias). + +Diagnostic: +– Variance: Training error will be much lower than test error. +– Bias: Training error will also be high. + +Andrew Y. Ng + + More on bias vs. variance + +Typical learning curve for high variance: + +error + +Test error +Desired performance +Training error + +m (training set size) + +• Test error still decreasing as m increases. Suggests larger training set +will help. +• Large gap between training and test error. +Andrew Y. Ng + + More on bias vs. variance + +Typical learning curve for high bias: + +error + +Test error +Training error +Desired performance + +m (training set size) + +• Even training error is unacceptably high. +• Small gap between training and test error. +Andrew Y. Ng + + Diagnostics tell you what to try next + +Bayesian logistic regression, implemented with gradient descent. +Fixes to try: +– +– +– +– +– +– +– +– + +Try getting more training examples. +Try a smaller set of features. +Try a larger set of features. +Try email header features. +Run gradient descent for more iterations. +Try Newton’s method. +Use a different value for λ. +Try using an SVM. + +Fixes high variance. +Fixes high variance. +Fixes high bias. +Fixes high bias. + +Andrew Y. Ng + + Optimization algorithm diagnostics + +• Bias vs. variance is one common diagnostic. +• For other problems, it’s usually up to your own ingenuity to construct your +own diagnostics to figure out what’s wrong. +• Another example: +– Bayesian logistic regression gets 2% error on spam, and 2% error on non-spam. +(Unacceptably high error on non-spam.) +– SVM using a linear kernel gets 10% error on spam, and 0.01% error on nonspam. (Acceptable performance.) +– But you want to use logistic regression, because of computational efficiency, etc. + +• What to do next? + +Andrew Y. Ng + + More diagnostics + +• Other common questions: +– Is the algorithm (gradient descent for logistic regression) converging? + +Objective + +J(θ) + +Iterations +It’s often very hard to tell if an algorithm has converged yet by looking at the objective. +Andrew Y. Ng + + More diagnostics + +• Other common questions: +– Is the algorithm (gradient descent for logistic regression) converging? +– Are you optimizing the right function? +– I.e., what you care about: + +(weights w(i) higher for non-spam than for spam). +– Bayesian logistic regression? Correct value for λ? + +– SVM? Correct value for C? + +Andrew Y. Ng + + Diagnostic +An SVM outperforms Bayesian logistic regression, but you really want to deploy Bayesian +logistic regression for your application. +Let θSVM be the parameters learned by an SVM. +Let θBLR be the parameters learned by Bayesian logistic regression. +You care about weighted accuracy: + +θSVM outperforms θBLR. So: + +BLR tries to maximize: + +Diagnostic: + +Andrew Y. Ng + + Two cases + +Case 1: + +But BLR was trying to maximize J(θ). This means that θBLR fails to maximize J, and the +problem is with the convergence of the algorithm. Problem is with optimization +algorithm. + +Case 2: + +This means that BLR succeeded at maximizing J(θ). But the SVM, which does worse on +J(θ), actually does better on weighted accuracy a(θ). +This means that J(θ) is the wrong function to be maximizing, if you care about a(θ). +Problem is with objective function of the maximization problem. + +Andrew Y. Ng + + Diagnostics tell you what to try next + +Bayesian logistic regression, implemented with gradient descent. +Fixes to try: +– +– +– +– +– +– +– +– + +Try getting more training examples. +Try a smaller set of features. +Try a larger set of features. +Try email header features. +Run gradient descent for more iterations. +Try Newton’s method. +Use a different value for λ. +Try using an SVM. + +Fixes high variance. +Fixes high variance. +Fixes high bias. +Fixes high bias. +Fixes optimization algorithm. +Fixes optimization algorithm. +Fixes optimization objective. +Fixes optimization objective. + +Andrew Y. Ng + + The Stanford Autonomous Helicopter + +Payload: 14 pounds +Weight: 32 pounds + +Andrew Y. Ng + + Machine learning algorithm + +Simulator + +1. Build a simulator of helicopter. +2. Choose a cost function. Say J(θ) = ||x – xdesired||2 + +(x = helicopter position) + +3. Run reinforcement learning (RL) algorithm to fly helicopter in simulation, so +as to try to minimize cost function: +θRL = arg minθ J(θ) + +Suppose you do this, and the resulting controller parameters θRL gives much worse +performance than your human pilot. What to do next? +Improve simulator? +Modify cost function J? +Modify RL algorithm? + +Andrew Y. Ng + + Debugging an RL algorithm + +The controller given by θRL performs poorly. +Suppose that: +1. The helicopter simulator is accurate. +2. The RL algorithm correctly controls the helicopter (in simulation) so as to +minimize J(θ). +3. Minimizing J(θ) corresponds to correct autonomous flight. + +Then: The learned parameters θRL should fly well on the actual helicopter. +Diagnostics: +1. If θRL flies well in simulation, but not in real life, then the problem is in the +simulator. Otherwise: +2. Let θhuman be the human control policy. If J(θhuman) < J(θRL), then the problem is +in the reinforcement learning algorithm. (Failing to minimize the cost function J.) +3. If J(θhuman) ≥ J(θRL), then the problem is in the cost function. (Maximizing it +doesn’t correspond to good autonomous flight.) + +Andrew Y. Ng + + More on diagnostics + +• Quite often, you’ll need to come up with your own diagnostics to figure out +what’s happening in an algorithm. +• Even if a learning algorithm is working well, you might also run diagnostics to +make sure you understand what’s going on. This is useful for: +– Understanding your application problem: If you’re working on one important ML +application for months/years, it’s very valuable for you personally to get a intuitive +understand of what works and what doesn’t work in your problem. +– Writing research papers: Diagnostics and error analysis help convey insight about +the problem, and justify your research claims. +– I.e., Rather than saying “Here’s an algorithm that works,” it’s more interesting to +say “Here’s an algorithm that works because of component X, and here’s my +justification.” + +• Good machine learning practice: Error analysis. Try to understand what +your sources of error are. + +Andrew Y. Ng + + Error Analysis + +Andrew Y. Ng + + Error analysis + +Many applications combine many different learning components into a +“pipeline.” E.g., Face recognition from images: [contrived example] + +Camera +image + +Preprocess +(remove background) + +Eyes segmentation +Face detection + +Nose segmentation + +Logistic regression + +Label + +Mouth segmentation +Andrew Y. Ng + + Error analysis +Camera +image + +Preprocess +(remove background) + +Eyes segmentation +Face detection + +Nose segmentation + +Logistic regression + +Label + +Mouth segmentation + +How much error is attributable to each of the +components? +Plug in ground-truth for each component, and +see how accuracy changes. +Conclusion: Most room for improvement in face +detection and eyes segmentation. + +Component + +Accuracy + +Overall system + +85% + +Preprocess (remove +background) + +85.1% + +Face detection + +91% + +Eyes segmentation + +95% + +Nose segmentation + +96% + +Mouth segmentation + +97% + +Logistic regression + +100% +Andrew +Y. Ng + + Ablative analysis + +Error analysis tries to explain the difference between current performance and +perfect performance. +Ablative analysis tries to explain the difference between some baseline (much +poorer) performance and current performance. +E.g., Suppose that you’ve build a good anti-spam classifier by adding lots of +clever features to logistic regression: +– +– +– +– +– +– + +Spelling correction. +Sender host features. +Email header features. +Email text parser features. +Javascript parser. +Features from embedded images. + +Question: How much did each of these components really help? + +Andrew Y. Ng + + Ablative analysis + +Simple logistic regression without any clever features get 94% performance. +Just what accounts for your improvement from 94 to 99.9%? +Ablative analysis: Remove components from your system one at a time, to see +how it breaks. +Component + +Accuracy + +Overall system + +99.9% + +Spelling correction + +99.0 + +Sender host features + +98.9% + +Email header features + +98.9% + +Email text parser features + +95% + +Javascript parser + +94.5% + +Features from images + +94.0% + +[baseline] + +Conclusion: The email text parser features account for most of the +improvement. +Andrew Y. Ng + + Getting started on a +learning problem + +Andrew Y. Ng + + Getting started on a problem +Approach #1: Careful design. +• + +Spend a long term designing exactly the right features, collecting the right dataset, +and designing the right algorithmic architecture. + +• + +Implement it and hope it works. + +• + +Benefit: Nicer, perhaps more scalable algorithms. May come up with new, elegant, +learning algorithms; contribute to basic research in machine learning. + +Approach #2: Build-and-fix. +• + +Implement something quick-and-dirty. + +• + +Run error analyses and diagnostics to see what’s wrong with it, and fix its errors. + +• + +Benefit: Will often get your application problem working more quickly. Faster time to +market. + +Andrew Y. Ng + + Premature statistical optimization +Very often, it’s not clear what parts of a system are easy or difficult to build, and +which parts you need to spend lots of time focusing on. E.g., + +Camera +image + +Preprocess +(remove background) + +This system’s much too +complicated for a first attempt. + +Eyes segmentation +Face detection + +Nose segmentation + +Step 1 of designing a learning +system: Plot the data. +Label + +Logistic regression + +Mouth segmentation + +The only way to find out what needs work is to implement something quickly, +and find out what parts break. +[But this may be bad advice if your goal is to come up with new machine +learning algorithms.] +Andrew Y. Ng + + The danger of over-theorizing + +3d similarity +learning + +Color +invariance + +Obstacle +avoidance + +VC +dimension + +Object +detection + +Navigation + +Mail +delivery +robot + +Differential +geometry of +3d manifolds + +Robot +manipulation + +[Based on Papadimitriou, 1995] + +Complexity of +non-Riemannian +geometries + +… + +Convergence +bounds for +sampled nonmonotonic logic + +Andrew Y. Ng + + Summary + +Andrew Y. Ng + + Summary + +• Time spent coming up with diagnostics for learning algorithms is time wellspent. +• It’s often up to your own ingenuity to come up with right diagnostics. +• Error analyses and ablative analyses also give insight into the problem. +• Two approaches to applying learning algorithms: +– Design very carefully, then implement. +• Risk of premature (statistical) optimization. +– Build a quick-and-dirty prototype, diagnose, and fix. + +Andrew Y. Ng + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-cvxopt.txt b/Lectures/aimlcs229/cs229-cvxopt.txt new file mode 100644 index 0000000..c0f18ab --- /dev/null +++ b/Lectures/aimlcs229/cs229-cvxopt.txt @@ -0,0 +1,663 @@ +Convex Optimization Overview +Zico Kolter +October 19, 2007 + +1 + +Introduction + +Many situations arise in machine learning where we would like to optimize the value of +some function. That is, given a function f : Rn → R, we want to find x ∈ Rn that minimizes +(or maximizes) f (x). We have already seen several examples of optimization problems in +class: least-squares, logistic regression, and support vector machines can all be framed as +optimization problems. +It turns out that in the general case, finding the global optimum of a function can be a +very difficult task. However, for a special class of optimization problems, known as convex +optimization problems, we can efficiently find the global solution in many cases. Here, +“efficiently” has both practical and theoretical connotations: it means that we can solve +many real-world problems in a reasonable amount of time, and it means that theoretically +we can solve problems in time that depends only polynomially on the problem size. +The goal of these section notes and the accompanying lecture is to give a very brief +overview of the field of convex optimization. Much of the material here (including some +of the figures) is heavily based on the book Convex Optimization [1] by Stephen Boyd and +Lieven Vandenberghe (available for free online), and EE364, a class taught here at Stanford +by Stephen Boyd. If you are interested in pursuing convex optimization further, these are +both excellent resources. + +2 + +Convex Sets + +We begin our look at convex optimization with the notion of a convex set. +Definition 2.1 A set C is convex if, for any x, y ∈ C and θ ∈ R with 0 ≤ θ ≤ 1, +θx + (1 − θ)y ∈ C. +Intuitively, this means that if we take any two elements in C, and draw a line segment +between these two elements, then every point on that line segment also belongs to C. Figure +1 shows an example of one convex and one non-convex set. The point θx + (1 − θ)y is called +a convex combination of the points x and y. +1 + + (a) + +(b) + +Figure 1: Examples of a convex set (a) and a non-convex set (b). + +2.1 + +Examples + +• All of Rn . It should be fairly obvious that given any x, y ∈ Rn , θx + (1 − θ)y ∈ Rn . +• The non-negative orthant, Rn+ . The non-negative orthant consists of all vectors in +Rn whose elements are all non-negative: Rn+ = {x : xi ≥ 0 ∀i = 1, . . . , n}. To show +that this is a convex set, simply note that given any x, y ∈ Rn+ and 0 ≤ θ ≤ 1, +(θx + (1 − θ)y)i = θxi + (1 − θ)yi ≥ 0 ∀i. +• Norm balls. Let · be some norm on Rn (e.g., the Euclidean norm, x 2 = +n +n +2 +i=1 xi ). Then the set {x : x ≤ 1} is a convex set. To see this, suppose x, y ∈ R , +with x ≤ 1, y ≤ 1, and 0 ≤ θ ≤ 1. Then +θx + (1 − θ)y ≤ θx + (1 − θ)y = θ x + (1 − θ) y ≤ 1 +where we used the triangle inequality and the positive homogeneity of norms. +• Affine subspaces and polyhedra. Given a matrix A ∈ Rm×n and a vector b ∈ Rm , +an affine subspace is the set {x ∈ Rn : Ax = b} (note that this could possibly be empty +if b is not in the range of A). Similarly, a polyhedron is the (again, possibly empty) +set {x ∈ Rn : Ax b}, where ‘ ’ here denotes componentwise inequality (i.e., all the +entries of Ax are less than or equal to their corresponding element in b).1 To prove +this, first consider x, y ∈ Rn such that Ax = Ay = b. Then for 0 ≤ θ ≤ 1, +A(θx + (1 − θ)y) = θAx + (1 − θ)Ay = θb + (1 − θ)b = b. +Similarly, for x, y ∈ Rn that satisfy Ax ≤ b and Ay ≤ b and 0 ≤ θ ≤ 1, +A(θx + (1 − θ)y) = θAx + (1 − θ)Ay ≤ θb + (1 − θ)b = b. +1 +Similarly, for two vectors x, y ∈ Rn , x y denotes that each element of X is greater than or equal to the +corresponding element in b. Note that sometimes ‘≤’ and ‘≥’ are used in place of ‘ ’ and ‘ ’; the meaning +must be determined contextually (i.e., both sides of the inequality will be vectors). + +2 + + • Intersections of convex sets. Suppose C1 , C2 , . . . , Ck are convex sets. Then their +intersection +k + +Ci = {x : x ∈ Ci ∀i = 1, . . . , k} +i=1 +k +i=1 + +is also a convex set. To see this, consider x, y ∈ + +Ci and 0 ≤ θ ≤ 1. Then, + +θx + (1 − θ)y ∈ Ci ∀i = 1, . . . , k +by the definition of a convex set. Therefore +k + +θx + (1 − θ)y ∈ + +Ci . +i=1 + +Note, however, that the union of convex sets in general will not be convex. +• Positive semidefinite matrices. The set of all symmetric positive semidefinite +matrices, often times called the positive semidefinite cone and denoted Sn+ , is a convex +set (in general, Sn ⊂ Rn×n denotes the set of symmetric n × n matrices). Recall that +a matrix A ∈ Rn×n is symmetric positive semidefinite if and only if A = AT and for +all x ∈ Rn , xT Ax ≥ 0. Now consider two symmetric positive semidefinite matrices +A, B ∈ Sn+ and 0 ≤ θ ≤ 1. Then for any x ∈ Rn , +xT (θA + (1 − θ)B)x = θxT Ax + (1 − θ)xT Bx ≥ 0. +The same logic can be used to show that the sets of all positive definite, negative +definite, and negative semidefinite matrices are each also convex. + +3 + +Convex Functions + +A central element in convex optimization is the notion of a convex function. +Definition 3.1 A function f : Rn → R is convex if its domain (denoted D(f )) is a convex +set, and if, for all x, y ∈ D(f ) and θ ∈ R, 0 ≤ θ ≤ 1, +f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y). +Intuitively, the way to think about this definition is that if we pick any two points on the +graph of a convex function and draw a straight line between then, then the portion of the +function between these two points will lie below this straight line. This situation is pictured +in Figure 2.2 +We say a function is strictly convex if Definition 3.1 holds with strict inequality for +x = y and 0 < θ < 1. We say that f is concave if −f is convex, and likewise that f is +strictly concave if −f is strictly convex. +2 + +Don’t worry too much about the requirement that the domain of f be a convex set. This is just a +technicality to ensure that f (θx + (1 − θ)y) is actually defined (if D(f ) were not convex, then it could be +that f (θx + (1 − θ)y) is undefined even though x, y ∈ D(f )). + +3 + + Figure 2: Graph of a convex function. By the definition of convex functions, the line connecting two points on the graph must lie above the function. + +3.1 + +First Order Condition for Convexity + +Suppose a function f : Rn → R is differentiable (i.e., the gradient3 ∇x f (x) exists at all +points x in the domain of f ). Then f is convex if and only if D(f ) is a convex set and for +all x, y ∈ D(f ), +f (y) ≥ f (x) + ∇x f (x)T (y − x). +The function f (x) + ∇x f (x)T (y − x) is called the first-order approximation to the +function f at the point x. Intuitively, this can be thought of as approximating f with its +tangent line at the point x. The first order condition for convexity says that f is convex if +and only if the tangent line is a global underestimator of the function f . In other words, if +we take our function and draw a tangent line at any point, then every point on this line will +lie below the corresponding point on f . +Similar to the definition of convexity, f will be strictly convex if this holds with strict +inequality, concave if the inequality is reversed, and strictly concave if the reverse inequality +is strict. + +Figure 3: Illustration of the first-order condition for convexity. +3 + +Recall that the gradient is defined as ∇x f (x) ∈ Rn , (∇x f (x))i = +Hessians, see the previous section notes on linear algebra. + +4 + +∂f (x) +∂xi . + +For a review on gradients and + + 3.2 + +Second Order Condition for Convexity + +Suppose a function f : Rn → R is twice differentiable (i.e., the Hessian4 ∇2x f (x) is defined +for all points x in the domain of f ). Then f is convex if and only if D(f ) is a convex set and +its Hessian is positive semidefinite: i.e., for any x ∈ D(f ), +∇2x f (x) + +0. + +Here, the notation ‘ ’ when used in conjunction with matrices refers to positive semidefiniteness, rather than componentwise inequality. 5 In one dimension, this is equivalent to the +condition that the second derivative f ′′ (x) always be positive (i.e., the function always has +positive curvature). +Again analogous to both the definition and first order conditions for convexity, f is strictly +convex if its Hessian is positive definite, concave if the Hessian is negative semidefinite, and +strictly concave if the Hessian is negative definite. + +3.3 + +Jensen’s Inequality + +Suppose we start with the inequality in the basic definition of a convex function +f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) for 0 ≤ θ ≤ 1. +Using induction, this can be fairly easily extended to convex combinations of more than one +point, +k + +k + +θi x i + +f +i=1 + +k + +≤ + +θi f (xi ) for +i=1 + +θi = 1, θi ≥ 0 ∀i. +i=1 + +In fact, this can also be extended to infinite sums or integrals. In the latter case, the +inequality can be written as +f + +p(x)xdx + +≤ + +p(x)f (x)dx for + +p(x)dx = 1, p(x) ≥ 0 ∀x. + +Because p(x) integrates to 1, it is common to consider it as a probability density, in which +case the previous equation can be written in terms of expectations, +f (E[x]) ≤ E[f (x)]. +This last inequality is known as Jensen’s inequality, and it will come up later in class.6 +2 + +∂ f (x) +Recall the Hessian is defined as ∇2x f (x) ∈ Rn×n , (∇2x f (x))ij = ∂x +i ∂xj +5 +Similarly, for a symmetric matrix X ∈ Sn , X 0 denotes that X is negative semidefinite. As with vector +inequalities, ‘≤’ and ‘≥’ are sometimes used in place of ‘ ’ and ‘ ’. Despite their notational similarity to +vector inequalities, these concepts are very different; in particular, X +0 does not imply that Xij ≥ 0 for +all i and j. +6 +In fact, all four of these equations are sometimes referred to as Jensen’s inequality, due to the fact that +they are all equivalent. However, for this class we will use the term to refer specifically to the last inequality +presented here. +4 + +5 + + 3.4 + +Sublevel Sets + +Convex functions give rise to a particularly important type of convex set called an α-sublevel +set. Given a convex function f : Rn → R and a real number α ∈ R, the α-sublevel set is +defined as +{x ∈ D(f ) : f (x) ≤ α}. +In other words, the α-sublevel set is the set of all points x such that f (x) ≤ α. +To show that this is a convex set, consider any x, y ∈ D(f ) such that f (x) ≤ α and +f (y) ≤ α. Then +f (θx + (1 − θ)y) ≤ θf (x) + (1 − θ)f (y) ≤ θα + (1 − θ)α = α. + +3.5 + +Examples + +We begin with a few simple examples of convex functions of one variable, then move on to +multivariate functions. +• Exponential. Let f : R → R, f (x) = eax for any a ∈ R. To show f is convex, we can +simply take the second derivative f ′′ (x) = a2 eax , which is positive for all x. +• Negative logarithm. Let f : R → R, f (x) = − log x with domain D(f ) = R++ +(here, R++ denotes the set of strictly positive real numbers, {x : x > 0}). Then +f ′′ (x) = 1/x2 > 0 for all x. +• Affine functions. Let f : Rn → R, f (x) = bT x + c for some b ∈ Rn , c ∈ R. In +this case the Hessian, ∇2x f (x) = 0 for all x. Because the zero matrix is both positive +semidefinite and negative semidefinite, f is both convex and concave. In fact, affine +functions of this form are the only functions that are both convex and concave. +• Quadratic functions. Let f : Rn → R, f (x) = 12 xT Ax + bT x + c for a symmetric +matrix A ∈ Sn , b ∈ Rn and c ∈ R. In our previous section notes on linear algebra, we +showed the Hessian for this function is given by +∇2x f (x) = A. +Therefore, the convexity or non-convexity of f is determined entirely by whether or +not A is positive semidefinite: if A is positive semidefinite then the function is convex +(and analogously for strictly convex, concave, strictly concave). If A is indefinite then +f is neither convex nor concave. +Note that the squared Euclidean norm f (x) = x 22 = xT x is a special case of quadratic +functions where A = I, b = 0, c = 0, so it is therefore a strictly convex function. + +6 + + • Norms. Let f : Rn → R be some norm on Rn . Then by the triangle inequality and +positive homogeneity of norms, for x, y ∈ Rn , 0 ≤ θ ≤ 1, +f (θx + (1 − θ)y) ≤ f (θx) + f ((1 − θ)y) = θf (x) + (1 − θ)f (y). +This is an example of a convex function where it is not possible to prove convexity based +on the second or first order conditions, because norms are not generally differentiable +everywhere (e.g., the 1-norm, ||x||1 = ni=1 |xi |, is non-differentiable at all points where +any xi is equal to zero). +• Nonnegative weighted sums of convex functions. Let f1 , f2 , . . . , fk be convex +functions and w1 , w2 , . . . , wk be nonnegative real numbers. Then +k + +f (x) = + +wi fi (x) +i=1 + +is a convex function, since +k + +f (θx + (1 − θ)y) = + +wi fi (θx + (1 − θ)y) +i=1 +k + +≤ + +wi (θfi (x) + (1 − θ)fi (y)) +i=1 +k + += θ + +k + +wi fi (x) + (1 − θ) +i=1 + +wi fi (y) +i=1 + += θf (x) + (1 − θ)f (x). + +4 + +Convex Optimization Problems + +Armed with the definitions of convex functions and sets, we are now equipped to consider +convex optimization problems. Formally, a convex optimization problem in an optimization problem of the form +minimize f (x) +subject to x ∈ C +where f is a convex function, C is a convex set, and x is the optimization variable. However, +since this can be a little bit vague, we often write it often written as +minimize f (x) +subject to gi (x) ≤ 0, +hi (x) = 0, + +i = 1, . . . , m +i = 1, . . . , p + +where f is a convex function, gi are convex functions, and hi are affine functions, and x is +the optimization variable. +7 + + Is it imporant to note the direction of these inequalities: a convex function gi must be +less than zero. This is because the 0-sublevel set of gi is a convex set, so the feasible region, +which is the intersection of many convex sets, is also convex (recall that affine subspaces are +convex sets as well). If we were to require that gi ≥ 0 for some convex gi , the feasible region +would no longer be a convex set, and the algorithms we apply for solving these problems +would not longer be guaranteed to find the global optimum. Also notice that only affine +functions are allowed to be equality constraints. Intuitively, you can think of this as being +due to the fact that an equality constraint is equivalent to the two inequalities hi ≤ 0 and +hi ≥ 0. However, these will both be valid constraints if and only if hi is both convex and +concave, i.e., hi must be affine. +The optimal value of an optimization problem is denoted p⋆ (or sometimes f ⋆ ) and is +equal to the minimum possible value of the objective function in the feasible region7 +p⋆ = min{f (x) : gi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p}. +We allow p⋆ to take on the values +∞ and −∞ when the problem is either infeasible (the +feasible region is empty) or unbounded below (there exists feasible points such that f (x) → +−∞), respectively. We say that x⋆ is an optimal point if f (x⋆ ) = p⋆ . Note that there can +be more than one optimal point, even when the optimal value is finite. + +4.1 + +Global Optimality in Convex Problems + +Before stating the result of global optimality in convex problems, let us formally define +the concepts of local optima and global optima. Intuitively, a feasible point is called locally +optimal if there are no “nearby” feasible points that have a lower objective value. Similarly, +a feasible point is called globally optimal if there are no feasible points at all that have a +lower objective value. To formalize this a little bit more, we give the following two definitions. +Definition 4.1 A point x is locally optimal if it is feasible (i.e., it satisfies the constraints +of the optimization problem) and if there exists some R > 0 such that all feasible points z +with x − z 2 ≤ R, satisfy f (x) ≤ f (z). +Definition 4.2 A point x is globally optimal if it is feasible and for all feasible points z, +f (x) ≤ f (z). +We now come to the crucial element of convex optimization problems, from which they +derive most of their utility. The key idea is that for a convex optimization problem +all locally optimal points are globally optimal . +Let’s give a quick proof of this property by contradiction. Suppose that x is a locally +optimal point which is not globally optimal, i.e., there exists a feasible point y such that +7 + +Math majors might note that the min appearing below should more correctly be an inf. We won’t worry +about such technicalities here, and use min for simplicity. + +8 + + f (x) > f (y). By the definition of local optimality, there exist no feasible points z such that +x − z 2 ≤ R and f (z) < f (x). But now suppose we choose the point +z = θy + (1 − θ)x with θ = + +R +2 x−y + +. +2 + +Then +x−z + +2 + += + +x− + +R +2 x−y + +y+ 1− +2 + +R += +(x − y) +2 x−y 2 += R/2 ≤ R. + +R +2 x−y + +x +2 + +2 + +2 + +In addition, by the convexity of f we have +f (z) = f (θy + (1 − θ)x) ≤ θf (y) + (1 − θ)f (x) < f (x). +Furthermore, since the feasible set is a convex set, and since x and y are both feasible +z = θy + (1 − θ) will be feasible as well. Therefore, z is a feasible point, with x − z 2 < R +and f (z) < f (x). This contradicts our assumption, showing that x cannot be locally optimal. + +4.2 + +Special Cases of Convex Problems + +For a variety of reasons, it is often times convenient to consider special cases of the general +convex programming formulation. For these special cases we can often devise extremely +efficient algorithms that can solve very large problems, and because of this you will probably +see these special cases referred to any time people use convex optimization techniques. +• Linear Programming. We say that a convex optimization problem is a linear +program (LP) if both the objective function f and inequality constraints gi are affine +functions. In other words, these problems have the form +minimize cT x + d +subject to Gx h +Ax = b +where x ∈ Rn is the optimization variable, c ∈ Rn , d ∈ R, G ∈ Rm×n , h ∈ Rm , +A ∈ Rp×n , b ∈ Rp are defined by the problem, and ‘ ’ denotes elementwise inequality. +• Quadratic Programming. We say that a convex optimization problem is a quadratic +program (QP) if the inequality constraints gi are still all affine, but if the objective +function f is a convex quadratic function. In other words, these problems have the +form, +minimize 12 xT P x + cT x + d +subject to Gx h +Ax = b +9 + + where again x ∈ Rn is the optimization variable, c ∈ Rn , d ∈ R, G ∈ Rm×n , h ∈ Rm , +A ∈ Rp×n , b ∈ Rp are defined by the problem, but we also have P ∈ Sn+ , a symmetric +positive semidefinite matrix. +• Quadratically Constrained Quadratic Programming. We say that a convex +optimization problem is a quadratically constrained quadratic program (QCQP) +if both the objective f and the inequality constraints gi are convex quadratic functions, +minimize +subject to + +1 T +x P x + cT x + d +2 +1 T +x Qi x + riT x + si +2 + +≤ 0, + +i = 1, . . . , m + +Ax = b + +where, as before, x ∈ Rn is the optimization variable, c ∈ Rn , d ∈ R, A ∈ Rp×n , b ∈ Rp , +P ∈ Sn+ , but we also have Qi ∈ Sn+ , ri ∈ Rn , si ∈ R, for i = 1, . . . , m. +• Semidefinite Programming. This last example is a bit more complex than the previous ones, so don’t worry if it doesn’t make much sense at first. However, semidefinite +programming is become more and more prevalent in many different areas of machine +learning research, so you might encounter these at some point, and it is good to have an +idea of what they are. We say that a convex optimization problem is a semidefinite +program (SDP) if it is of the form +minimize tr(CX) +subject to tr(Ai X) = bi , +X 0 + +i = 1, . . . , p + +where the symmetric matrix X ∈ Sn is the optimization variable, the symmetric matrices C, A1 , . . . , Ap ∈ Sn are defined by the problem, and the constraint X 0 means +that we are constraining X to be positive semidefinite. This looks a bit different than +the problems we have seen previously, since the optimization variable is now a matrix +instead of a vector. If you are curious as to why such a formulation might be useful, +you should look into a more advanced course or book on convex optimization. +It should be fairly obvious from the definitions that quadratic programs are more general +than linear programs (since a linear program is just a special case of a quadratic program +where P = 0), and likewise that quadratically constrained quadratic programs are more +general than quadratic programs. However, what is not obvious at all is that semidefinite +programs are in fact more general than all the previous types. That is, any quadratically +constrained quadratic program (and hence any quadratic program or linear program) can +be expressed as a semidefinte program. We won’t discuss this relationship further in this +document, but this might give you just a small idea as to why semidefinite programming +could be useful. + +10 + + 4.3 + +Examples + +Now that we’ve covered plenty of the boring math and formalisms behind convex optimization, we can finally get to the fun part: using these techniques to solve actual problems. +We’ve already encountered a few such optimization problems in class, and in nearly every +field, there is a good chance that someone has tried to apply convex optimization to solve +some problem. +• Support Vector Machines. One of the most prevalent applications of convex optimization methods in machine learning is the support vector machine classifier. As +discussed in class, finding the support vector classifier (in the case with slack variables) +can be formulated as the optimization problem +minimize 12 w 22 + C m +i=1 ξi +subject to y (i) (wT x(i) + b) ≥ 1 − ξi , +ξi ≥ 0, + +i = 1, . . . , m +i = 1, . . . , m + +with optimization variables w ∈ Rn , ξ ∈ Rm , b ∈ R, and where C ∈ R and x(i) , y (i) , i = +1, . . . m are defined by the problem. This is an example of a quadratic program, which +we try to put the problem into the form described in the previous section. In particular, +if define k = m + n + 1, let the optimization variable be + + +w +x ∈ Rk ≡  ξ  +b + +and define the matrices + +P ∈ Rk×k + + + + +0 +I 0 0 +=  0 0 0  , c ∈ Rk =  C · 1  , +0 +0 0 0 + + +−1 +−I −y +, h ∈ R2m = +0 +−I 0 +all ones, and X and y are defined as in class, + + + +y (1) + + y (2)  + + + , y ∈ Rm =  + ..  . + + +.  + +(m) +T +y +x(m) + +−diag(y)X +0 +where I is the identity, 1 is the vector of + +T +x(1) + (2) T + x +m×n +X∈R += + .. + . +G ∈ R2m×k = + +You should try to convince yourself that the quadratic program described in the previous section, when using these matrices defined above, is equivalent to the SVM +optimization problem. In reality, it is fairly easy to see that there the SVM optimization problem has a quadratic objective and linear constraints, so we typically don’t +need to put it into standard form to “prove” that it is a QP, and would only do so if +we are using an off-the-shelf solver that requires the input to be in standard form. +11 + + • Constrained least squares. In class we have also considered the least squares problem, where we want to minimize Ax − b 22 for some matrix A ∈ Rm×n and b ∈ Rm . +As we saw, this particular problem can actually be solved analytically via the normal +equations. However, suppose that we also want to constrain the entries in the solution +x to lie within some predefined ranges. In other words, suppose we weanted to solve +the optimization problem, +minimize 12 Ax − b +subject to l x u + +2 +2 + +with optimization variable x and problem data A ∈ Rm×n , b ∈ Rm , l ∈ Rn , and u ∈ Rn . +This might seem like a fairly simple additional constraint, but it turns out that there +will no longer be an analytical solution. However, you should be able to convince +yourself that this optimization problem is a quadratic program, with matrices defined +by +1 +1 +P ∈ Rn×n = AT A, c ∈ Rn = −bT A, d ∈ R = bT b, +2 +2 +−l +−I 0 +. +, h ∈ R2n = +G ∈ R2n×2n = +u +0 I +• Maximum Likelihood for Logistic Regression. For homework one, you were +required to show that the log-likelihood of the data in a logistic model was concave. +This log likehood under such a model is +n + +y (i) ln g(θT x(i) ) + (1 − y (i) ) ln(1 − g(θT x(i) )) + +ℓ(θ) = +i=1 + +where g(z) denotes the logistic function g(z) = 1/(1 + e−z ). Finding the maximum +likelihood estimate is then a task of maximizing the log-likelihood (or equivalently, +minimizing the negative log-likelihood, a convex function), i.e., +minimize −ℓ(θ) +with optimization variable θ ∈ Rn and no constraints. +Unlike the previous two examples, it turns out that it is not so easy to put this problem into a “standard” form optimization problem. Nevertheless, you’ve seen on the +homework that the fact that ℓ is a concave function means that you can very efficiently +find the global solution using an algorithm such as Newton’s method. + +References +[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UP, 2004. +Online: http://www.stanford.edu/∼boyd/cvxbook/ + +12 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-cvxopt2.txt b/Lectures/aimlcs229/cs229-cvxopt2.txt new file mode 100644 index 0000000..b89e61a --- /dev/null +++ b/Lectures/aimlcs229/cs229-cvxopt2.txt @@ -0,0 +1,1016 @@ +Convex Optimization Overview (cnt’d) +Chuong B. Do +October 26, 2007 + +1 + +Recap + +During last week’s section, we began our study of convex optimization, the study of +mathematical optimization problems of the form, +f (x) +minimize +n +x∈R + +subject to gi (x) ≤ 0, i = 1, . . . , m, +hi (x) = 0, i = 1, . . . , p, + +(1) + +where x ∈ Rn is the optimization variable, f : Rn → R and gi : Rn → R are convex functions, +and hi : Rn → R are affine functions. In a convex optimization problem, the convexity of both +the objective function f and the feasible region (i.e., the set of x’s satisfying all constraints) +allows us to conclude that any feasible locally optimal point must also be globally optimal. +This fact provides the key intuition for why convex optimization problems can in general be +solved efficiently. +In these lecture notes, we continue our foray into the field of convex optimization. In +particular, we will introduce the theory of Lagrange duality for convex optimization problems +with inequality and equality constraints. We will also discuss generic yet efficient algorithms +for solving convex optimization problems, and then briefly mention directions for further +exploration. + +2 + +Duality + +To explain the fundamental ideas behind duality theory, we start with a motivating example +based on CS 229 homework grading. We prove a simple weak duality result in this setting, +and then relate it to duality in optimization. We then discuss strong duality and the KKT +optimality conditions. + +2.1 + +A motivating example: CS 229 homework grading + +In CS 229, students must complete four homeworks throughout the quarter, each consisting +of five questions apiece. Suppose that during one year that the course is offered, the TAs +1 + + decide to economize on their work load for the quarter by grading only one problem on +each submitted problem set. Nevertheless, they also require that every student submit +an attempted solution to every problem (a requirement which, if violated, would lead to +automatic failure of the course). +Because they are extremely cold-hearted1 , the TAs always try to ensure that the students +lose as many points as possible; if the TAs grade a problem that the student did not attempt, +the number of points lost is set to +∞ to denote automatic failure in the course. Conversely, +each student in the course seeks to minimize the number of points lost on his or her assignments, and thus must decide on a strategy—i.e., an allocation of time to problems—that +minimizes the number of points lost on the assignment. +The struggle between student and TAs can be summarized in a matrix A = (aij ) ∈ Rn×m , +whose columns correspond to different problems that the TAs might grade, and whose rows +correspond to different strategies for time allocation that the student might use for the +problem set. For example, consider the following matrix, + +✷ +5 +✻ +A=✹ 8 + +✸ + +5 +5 5 5 +8 +1 8 8 ✼ +✺, ++∞ +∞ +∞ 0 +∞ + +Here, the student must decide between three strategies (corresponding to the three rows of +the matrix, A): +• i = 1: she invests an equal effort into all five problems and hence loses at most 5 points +on each problem, +• i = 2: she invests more time into problem 3 than the other four problems, and +• i = 3: she skips four problems in order to guarantee no points lost on problem 4. +Similarly, the TAs must decide between five strategies (j ∈ {1, 2, 3, 4, 5}) corresponding to +the choice of problem graded. +If the student is forced to submit the homework without knowing the TAs choice of +problem to be graded, and if the TAs are allowed to decide which problem to grade after +having seen the student’s problem set, then the number of points she loses will be: +p∗ = min max aij +i + +j + +(= 5 in the example above) + +(P) + +where the order of the minimization and maximization reflect that for each fixed student time +allocation strategy i, the TAs will have the opportunity to choose the worst scoring problem +maxj aij to grade. However, if the TAs announce beforehand which homework problem will +be graded, then the the number of points lost will be: +d∗ = max min aij +j + +i + +(= 0 in the example above) + +(D) + +where this time, for each possible announced homework problem j to be graded, the student +will have the opportunity to choose the optimal time allocation strategy, mini aij , which loses +1 + +Clearly, this is a fictional example. The CS 229 TAs want you to succeed. Really, we do. + +2 + + her the fewest points. Here, (P) is called the primal optimization problem whereas (D) is +called the dual optimization problem. Rows containing +∞ values correspond to strategies +where the student has flagrantly violated the TAs demand that all problems be attempted; +for reasons, which will become clear later, we refer to these rows as being primal-infeasible. +In the example, the value of the dual problem is lower than that of the primal problem, +i.e., d∗ = 0 < 5 = p∗ . This intuitively makes sense: the second player in this adversarial +game has the advantage of knowing his/her opponent’s strategy. This principle, however, +holds more generally: +Theorem 2.1 (Weak duality). For any matrix A = (aij ) ∈ Rm×n , it is always the case that +max min aij = d∗ ≤ p∗ = min max aij . +j + +i + +i + +j + +Proof. Let (id , jd ) be the row and column associated with d∗ , and let (ip , jp ) be the row and +column associated with p∗ . We have, +d∗ = aid jd ≤ aip jd ≤ aip jp = p∗ . +Here, the first inequality follows from the fact that aid jd is the smallest element in the jd th +column (i.e., id was the strategy chosen by the student after the TAs chose problem jd , and +hence, it must correspond to the fewest points lost in that column). Similarly, the second +inequality follow from the fact that aip jp is the largest element in the ip th row (i.e., jp was +the problem chosen by the TAs after the student picked strategy ip , so it must correspond +to the most points lost in that row). + +2.2 + +Duality in optimization + +The task of constrained optimization, it turns out, relates closely with the adversarial game +described in the previous section. To see the connection, first recall our original optimization +problem, +minimize +f (x) +x +subject to gi (x) ≤ 0, i = 1, . . . , m, +hi (x) = 0, i = 1, . . . , p. +Define the generalized Lagrangian to be +L(x, λ, ν) := f (x) + + +m +❳ +i=1 + +λi gi (x) + + +p +❳ + +νi hi (x). + +i=1 + +Here, the variables λ and ν are called the the dual variables (or Lagrange multipliers). +Analogously, the variables x are known as the primal variables. +The correspondence between primal/dual optimization and game playing can be pictured +informally using an infinite matrix whose rows are indexed by x ∈ Rn and whose columns +3 + + p +are indexed by (λ, ν) ∈ Rm ++ × R (i.e., λi ≥ 0, for i = 1, . . . , m). In particular, we have + +✷ +✸ +.. +... +.. +. +. +✻ +✼ +A=✻ +✻✹· · · L(x, λ, ν) · · ·✼✼✺ +. +. +. +.. + +.. + +.. + +Here, the “student” manipulates the primal variables x in order to minimize the Lagrangian +L(x, λ, ν) while the “TAs” manipulate the dual variables (λ, ν) in order to maximize the +Lagrangian. +To see the relationship between this game and the original optimization problem, we +formulate the following primal problem: +p∗ = min max +x + +λ,ν:λi ≥0 + += min +x + +L(x, λ, ν) + +θP (x) + +(P’) + +where θP (x) := maxλ,ν:λi ≥0 L(x, λ, ν). Computing p∗ is equivalent to our original convex +optimization primal in the following sense: for any candidate solution x, +• if gi (x) > 0 for some i ∈ {1, . . . , m}, then setting λi = ∞ gives θP (x) = ∞. +• if hi (x) = 0 for some i ∈ {1, . . . , m}, then setting λi = ∞·Sign(hi (x)) gives θP (x) = ∞. +• if x is feasible (i.e., x obeys all the constraints of our original optimization problem), +then θP (x) = f (x), where the maximum is obtained, for example, by setting all of the +λi ’s and νi ’s to zero. +Intuitively then, θP (x) behaves conceptually like an “unconstrained” version of the original +constrained optimization problem in which the infeasible region of f is “carved away” by +forcing θP (x) = ∞ for any infeasible x; thus, only points in the feasible region are left +as candidate minimizers. This idea of using penalties to ensure that minimizers stay in the +feasible region will come up later when talk about barrier algorithms for convex optimization. +By analogy to the CS 229 grading example, we can form the following dual problem: +d∗ = max min +x +λ,ν:λi ≥0 + += max + +λ,ν:λi ≥0 + +L(x, λ, ν) + +θD (λ, ν) + +(D’) + +where θD (λ, ν) := minx L(x, λ, ν). Dual problems can often be easier to solve than their +corresponding primal problems. In the case of SVMs, for instance, SMO is a dual optimization algorithm which considers joint optimization of pairs of dual variables. Its simple form +derives largely from the simplicity of the dual objective and the simplicity of the corresponding constraints on the dual variables. Primal-based SVM solutions are indeed possible, but +when the number of training examples is large and the kernel matrix K of inner products +Kij = K(x(i) , x(j) ) is large, dual-based optimization can be considerably more efficient. +4 + + Using an argument essentially identical to that presented in Theorem (2.1), we can show +that in this setting, we again have d∗ ≤ p∗ . This is the property of weak duality for +general optimization problems. Weak duality can be particularly useful in the design of +optimization algorithms. For example, suppose that during the course of an optimization +algorithm we have a candidate primal solution x and dual-feasible vector (λ, ν) such that +θP (x) − θD (λ, ν) ≤ . From weak duality, we have that +θD (λ, ν) ≤ d∗ ≤ p∗ ≤ θP (x), +implying that x and (λ, ν) must be -optimal (i.e., their objective functions differ by no more +than from the objective functions of the true optima x∗ and (λ∗ , ν ∗ ). +In practice, the dual objective θD (λ, ν) can often be found in closed form, thus allowing +the dual problem (D’) to depend only on the dual variables λ and ν. When the Lagrangian is +differentiable with respect to x, then a closed-form for θD (λ, ν) can often be found by setting +the gradient of the Lagrangian to zero, so as to ensure that the Lagrangian is minimized +with respect to x.2 An example derivation of the dual problem for the L1 soft-margin SVM +is shown in the Appendix. + +2.3 + +Strong duality + +For any primal/dual optimization problems, weak duality will always hold. In some cases, +however, the inequality d∗ ≤ p∗ may be replaced with equality, i.e., d∗ = p∗ ; this latter +condition is known as strong duality. Strong duality does not hold in general. When it +does however, the lower-bound property described in the previous section provide a useful +termination criterion for optimization algorithms. In particular, we can design algorithms +which simultaneously optimize both the primal and dual problems. Once the candidate +solutions x of the primal problem and (λ, ν) of the dual problem obey θP (x) − θD (λ, ν) ≤ , +then we know that both solutions are -accurate. This is guaranteed to happen provided +our optimization algorithm works properly, since strong duality guarantees that the optimal +primal and dual values are equal. +Conditions which guarantee strong duality for convex optimization problems are known +as constraint qualifications. The most commonly invoked constraint qualification, for +example, is Slater’s condition: +Theorem 2.2. Consider a convex optimization problem of the form (1), whose corresponding +primal and dual problems are given by (P’) and (D’). If there exists a primal feasible x for +2 + +Often, differentiating the Lagrangian with respect to x leads to the generation of additional requirements +on dual variables that must hold at any fixed point of the Lagrangian with respect to x. When these +constraints are not satisfied, one can show that the Lagrangian is unbounded below (i.e., θD (λ, ν) = −∞). +Since such points are clearly not optimal solutions for the dual problem, we can simply exclude them from +the domain of the dual problem altogether by adding the derived +Pmconstraints to the existing constraints of the +dual problem. An example of this is the derived constraint, i=1 αi y (i) = 0, in the SVM formulation. This +procedure of incorporating derived constraints into the dual problem is known as making dual constraints +explicit (see [1], page 224). + +5 + + which each inequality constraint is strictly satisfied (i.e., gi (x) < 0), then d∗ = p∗ .3 +The proof of this theorem is beyond the scope of this course. We will, however, point out +its application to the soft-margin SVMs described in class. Recall that soft-margin SVMs +were found by solving +m +❳ +1 +w 2+C +ξi +w,b,ξ +2 +i=1 +subject to y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m, +ξi ≥ 0, +i = 1, . . . , m. + +minimize + +Slater’s condition applies provided we can find at least one primal feasible setting of w, b, +and ξ where all inequalities are strict. It is easy to verify that w = 0, b = 0, ξ = 2 · 1 satisfies +these conditions (where 0 and 1 denote the vector of all 0’s and all 1’s, respectively), since +y (i) (wT x(i) + b) = y (i) (0T x(i) + 0) = 0 > −1 = 1 − 2 = 1 − ξi , + +i = 1, . . . , m, + +and the remaining m inequalities are trivially strictly satisfied. Hence, strong duality holds, +so the optimal values of the primal and dual soft-margin SVM problems will be equal. + +2.4 + +The KKT conditions + +In the case of differentiable unconstrained convex optimization problems, setting the gradient +to “zero” provides a simple means for identifying candidate local optima. For constrained +convex programming, do similar criteria exist for characterizing the optima of primal/dual +optimization problems? The answer, it turns out, is provided by a set of requirements known +as the Karush-Kuhn-Tucker (KKT) necessary and sufficient conditions (see [1], +pages 242-244). +Suppose that the constraint functions g1 , . . . , gm , h1 , . . . , hp are not only convex (the hi ’s +must be affine) but also differentiable. +˜ ν˜) are dual feasible, and if +Theorem 2.3. If x˜ is primal feasible and (λ, +˜ ν˜) = 0, +∇x L(˜ +x, λ, +˜ i gi (˜ +λ +x) = 0, + +(KKT1) +i = 1, . . . , m, + +(KKT2) + +˜ ν˜) are dual optimal, and strong duality holds. +then x˜ is primal optimal, (λ, +Theorem 2.4. If Slater’s condition holds, then conditions of Theorem 2.3 are necessary for +any (x∗ , λ∗ , ν ∗ ) such that x∗ is primal optimal and (λ∗ , ν ∗ ) are dual feasible. +3 + +One can actually show a more general version of Slater’s inequality, which requires only strict satisfaction +of non-affine inequality constraints (but allowing affine inequalities to be satisfied with equality). See [1], +page 226. + +6 + + (KKT1) is the standard gradient stationarity condition found for unconstrained differentiable +optimization problems. The set of inequalities corresponding to (KKT2) are known as the +KKT complementarity (or complementary slackness) conditions. In particular, +if x∗ is primal optimal and (λ∗ , ν ∗ ) is dual optimal, then (KKT2) implies that +λ∗i > 0 ⇒ gi (x∗ ) = 0 +gi (x∗ ) < 0 ⇒ λ∗i = 0 +That is, whenever λ∗i is greater than zero, its corresponding inequality constraint must be +tight; conversely, any strictly satisfied inequality must have have λ∗i equal to zero. Thus, we +can interpret the dual variables λ∗i as measuring the “importance” of a particular constraint +in characterizing the optimal point. +This interpretation provides an intuitive explanation for the difference between hardmargin and soft-margin SVMs. Recall the dual problems for a hard-margin SVM: +maximize +α,β + +m +❳ + +αi − + +i=1 + +m ❳ +m +1❳ +αi αi y (i) y (j) x(i) , x(j) +2 i=1 j=1 + +subject to αi ≥ 0, +m +❳ + +i = 1, . . . , m, + +(2) + +i = 1, . . . , m, + +(3) + +αi y (i) = 0, + +i=1 + +and the L1 soft-margin SVM: +m ❳ +m +1❳ +αi αi y (i) y (j) x(i) , x(j) +α,β +2 +i=1 j=1 +i=1 +subject to 0 ≤ αi ≤ C, + +maximize + +m +❳ + +m +❳ + +αi − + +αi y (i) = 0. + +i=1 + +Note that the only difference in the soft-margin formulation is the introduction of upper +bounds on the dual variables αi . Effectively, this upper bound constraint limits the influence +of any single primal inequality constraint (i.e., any single training example) on the decision +boundary, leading to improved robustness for the L1 soft-margin model. +What consequences do the KKT conditions have for practical optimization algorithms? +When Slater’s conditions hold, then the KKT conditions are both necessary and sufficient for +primal/dual optimality of a candidate primal solution x˜ and a corresponding dual solution +˜ ν˜). Therefore, many optimization algorithms work by trying to guarantee that the KKT +(λ, +conditions are satisfied; the SMO algorithm, for instance, works by iteratively identifying +Lagrange multipliers for which the corresponding KKT conditions are unsatisfied and then +“fixing” KKT complementarity.4 +4 + +See [1], pages 244-245 for an example of an optimization problem where the KKT conditions can be +solved directly, thus skipping the need for primal/dual optimization altogether. + +7 + + 3 + +Algorithms for convex optimization + +Thus far, we have talked about convex optimization problems and their properties. But +how does one solve a convex optimization problem in practice? In this section, we describe +a generic strategy for solving convex optimization problems known as the interior-point +method. This method combines a safe-guarded variant of Newton’s algorithm with a “barrier” technique for enforcing inequality constraints. + +3.1 + +Unconstrained optimization + +We consider first the problem of unconstrained optimization, i.e., +minimize f (x). +x + +In Newton’s algorithm for unconstrained optimization, we consider the Taylor approximation f˜ of the function f , centered at the current iterate xt . Discarding terms of higher +order than two, we have +1 +f˜(x) = f (xt ) + ∇x f (xt )T (x − xt ) + (x − xt )∇2x f (xt )(x − xt ). +2 +To minimize f˜(x), we can set its gradient to zero. In particular, if xnt denotes the minimum +of f˜(x), then +∇x f (xt ) + ∇2x f (xt )(xnt − xt ) = 0 +∇2x f (xt )(xnt − xt ) = −∇x f (xt ) +xnt − xt = −∇2x f (xt )−1 ∇x f (xt ) +xnt = xt − ∇2x f (xt )−1 ∇x f (xt ) +assuming ∇2x f (xt )T is positive definite (and hence, full rank). This, of course, is the standard +Newton algorithm for unconstrained minimization. +While Newton’s method converges quickly if given an initial point near the minimum, for +points far from the minimum, Newton’s method can sometimes diverge (as you may have +discovered in problem 1 of Problem Set #1 if you picked an unfortunate initial point!). A +simple fix for this behavior is to use a line-search procedure. Define the search direction d +to be, +d := ∇2x f (xt )−1 ∇x f (xt ). +A line-search procedure is an algorithm for finding an appropriate step size γ ≥ 0 such that +the iteration +xt+1 = xt − γ · d +will ensure that the function f decreases by a sufficient amount (relative to the size of the +step taken) during each iteration. +8 + + One simple yet effective method for doing this is called a backtracking line search. +In this method, one initially sets γ to 1 and then iteratively reduces γ by a multiplicative +factor β until f (xt + γ · d) is sufficiently smaller than f (xt ): +Backtracking line-search +• +• +• +• + +Choose α ∈ (0, 0.5), β ∈ (0, 1). +Set γ ← 1. +While f (xt + γ · d) > f (xt ) + γ · α∇x f (xt )T d, do γ ← βγ. +Return γ. + +Since the function f is known to decrease locally near xt in the direction of d, such a +step will be found, provided γ is small enough. For more details, see [1], pages 464-466. +In order to use Newton’s method, one must be able to compute and invert the Hessian +matrix ∇2x f (xt ), or equivalently, compute the search direction d indirectly without forming +the Hessian. For some problems, the number of primal variables x is sufficiently large +that computing the Hessian can be very difficult. In many cases, this can be dealt with +by clever use of linear algebra. In other cases, however, we can resort to other nonlinear +minimization schemes, such as quasi-Newton methods, which initially behave like gradient +descent but gradually construct approximations of the inverse Hessian based on the gradients +observed throughout the course of the optimization.5 Alternatively, nonlinear conjugate +gradient schemes (which augment the standard conjugate gradient (CG) algorithm for +solving linear least squares systems with a line-search) provide another generic blackbox tool +for multivariable function minimization which is simple to implement, yet highly effective in +practice.6 + +3.2 + +Inequality-constrained optimization + +Using our tools for unconstrained optimization described in the previous section, we now +tackle the (slightly) harder problem of constrained optimization. For simplicity, we consider +convex optimization problems without equality constraints7 , i.e., problems of the form, +minimize +f (x) +x +subject to gi (x) ≤ 0, i = 1, . . . , m. +5 + +For more information on Quasi-Newton methods, the standard reference is Jorge Nocedal and Stephen +J. Wright’s textbook, Numerical Optimization. +6 +For an excellent tutorial on the conjugate gradient method, see Jonathan Shewchuk’s tutorial, available +at: http://www.cs.cmu.edu/∼quake-papers/painless-conjugate-gradient.pdf +7 +In practice, there are many of ways of dealing with equality constraints. Sometimes, we can eliminate +equality constraints by either reparameterizing of the original primal problem, or converting to the dual +problem. A more general strategy is to rely on equality-constrained variants of Newton’s algorithms which +ensure that the equality constraints are satisfied at every iteration of the optimization. For a more complete +treatment of this topic, see [1], Chapter 10. + +9 + + We will also assume knowledge of a feasible starting point x0 which satisfies all of our +constraints with strict inequality (as needed for Slater’s condition to hold).8 +Recall that in our discussion of the Lagrangian-based formulation of the primal problem, +L(x, λ). + +min max +x + +λ:λi ≥0 + +we stated that the inner maximization, maxλ:λi ≥0 L(x, λ), was constructed in such a way +that the infeasible region of f was “carved away”, leaving only points in the feasible region +as candidate minima. The same idea of using penalties to ensure that minimizers stay in the +feasible region is the basis of barrier -based optimization. Specifically, if B(z) is the barrier +function + +✽ +❁0 z < 0 +B(z) = ✿ +∞ z ≥ 0, + +then the primal problem is equivalent to +min +x + +f (x) + + +m +❳ + +B(gi (x)). + +(4) + +i=1 + +When gi (x) < 0, the objective of the problem is simply f (x); infeasible points are “carved +away” using the barrier function B(z). +While conceptually correct, optimization using the straight barrier function B(x) is numerically difficult. To ameliorate this, the log-barrier optimization algorithm approximates +the solution to (4) by solving the unconstrained problem, +minimize +f (x) − +x + +m +1❳ +log(−gi (x)). +t i=1 + +for some fixed t > 0. Here, the function −(1/t) log(−z) ≈ B(z), and the accuracy of the +approximation increases as t → ∞. Rather than using a large value of t in order to obtain +a good approximation, however, the log-barrier algorithm works by solving a sequence of +unconstrained optimization problems, increasing t each time, and using the solution of the +previous unconstrained optimization problem as the initial point for the next unconstrained +optimization. Furthermore, at each point in the algorithm, the primal solution points stay +strictly in the interior of the feasible region: +8 + +For more information on finding feasible starting points for barrier algorithms, see [1], pages 579-585. +For inequality-problems where the primal problem is feasible but not strictly feasible, primal-dual interior +point methods are applicable, also described in [1], pages 609-615. + +10 + + Log-barrier optimization +• Choose µ > 1, t > 0. +• x ← x0 . +• Repeat until convergence: +(a) Compute x = min f (x) − +x + +m +1❳ +log(−gi (x)) using x as the initial point. +t i=1 + +(b) t ← µ · t, x ← x . +One might expect that as t increases, the difficulty of solving each unconstrained minimization problem also increases due to numerical issues or ill-conditioning of the optimization +problem. Surprisingly, Nesterov and Nemirovski showed in 1994 that this is not the case +for certain types of barrier functions, including the log-barrier; in particular, by using an +appropriate barrier function, one obtains a general convex optimization algorithm which +takes time polynomial in the dimensionality of the optimization variables and the desired +accuracy! + +4 + +Directions for further exploration + +In many real-world tasks, 90% of the challenge involves figuring out how to write an optimization problem in a convex form. Once the correct form has been found, a number of +pre-existing software packages for convex optimization have been well-tuned to handle different specific types of optimization problems. The following constitute a small sample of +the available tools: +• commerical packages: CPLEX, MOSEK +• MATLAB-based: CVX, Optimization Toolbox (linprog, quadprog), SeDuMi +• libraries: CVXOPT (Python), GLPK (C), COIN-OR (C) +• SVMs: LIBSVM, SVM-light +• machine learning: Weka (Java) +In particular, we specifically point out CVX as an easy-to-use generic tool for solving convex +optimization problems easily using MATLAB, and CVXOPT as a powerful Python-based +library which runs independently of MATLAB.9 If you’re interested in looking at some of the +other packages listed above, they are easy to find with a web search. In short, if you need a +specific convex optimization algorithm, pre-existing software packages provide a rapid way +to prototype your idea without having to deal with the numerical trickiness of implementing +your own complete convex optimization routines. +9 + +CVX is available at http://www.stanford.edu/∼boyd/cvx and CVXOPT is available at http://www. +ee.ucla.edu/∼vandenbe/cvxopt/. + +11 + + Also, if you find this material fascinating, make sure to check out Stephen Boyd’s class, +EE364: Convex Optimization I, which will be offered during the Winter Quarter. The +textbook for the class (listed as [1] in the References) has a wealth of information about +convex optimization and is available for browsing online. + +References +[1] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge UP, 2004. +Online: http://www.stanford.edu/∼boyd/cvxbook/ + +Appendix: The soft-margin SVM +To see the primal/dual action in practice, we derive the dual of the soft-margin SVM primal +presented in class, and corresponding KKT complementarity conditions. We have, +m +❳ +1 +w 2+C +ξi +w,b,ξ +2 +i=1 +subject to y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m, +ξi ≥ 0, +i = 1, . . . , m. + +minimize + +First, we put this into our standard form, with “≤ 0” inequality constraints and no equality +constraints. That is, +m +❳ +1 +2 +w +C +ξi +minimize +w,b,ξ +2 +i=1 +subject to 1 − ξi − y (i) (wT x(i) + b) ≤ 0, i = 1, . . . , m, +−ξi ≤ 0, +i = 1, . . . , m. + +Next, we form the generalized Lagrangian,10 +1 +L(w, b, ξ, α, β) = w +2 + +2 + ++C + +m +❳ +i=1 + +ξi + + +m +❳ + +(i) + +T + +αi (1 − ξi − y (w x + +(i) + ++ b)) − + +m +❳ + +βi ξi , + +i=1 + +i=1 + +which gives the primal and dual optimization problems: +max + +α,β:αi ≥0,βi ≥0 + +min +w,b,ξ + +θD (α, β) +θP (w, b, ξ) + +where θD (α, β) := min +w,b,ξ + +where θP (w, b, ξ) := + +L(w, b, ξ, α, β), +max + +α,β:αi ≥0,βi ≥0 + +L(w, b, ξ, α, β). + +(SVM-D) +(SVM-P) + +To get the dual problem in the form shown in the lecture notes, however, we still have a +little more work to do. In particular, +10 + +Here, it is important to note that (w, b, ξ) collectively play the role of the x primal variables. Similarly, +(α, β) collectively play the role of the λ dual variables used for inequality constraints. There are no “ν” dual +variables here since there are no affine constraints in this problem. + +12 + + 1. Eliminating the primal variables. To eliminate the primal variables from the dual +problem, we compute θD (α, β) by noticing that +θD (α, β) = minw,b,ξ + +L(w, b, ξ, α, β) + +is an unconstrained optimization problem, where the objective function L(w, b, ξ, α, β) +ˆ minimize the Lagrangian, +is differentiable. Therefore, for any fixed (α, β), if (w, +ˆ ˆb, ξ) +it must be the case that +ˆ α, β) = wˆ − +∇w L(w, +ˆ ˆb, ξ, + +m +❳ + +αi y (i) x(i) = 0 + +(5) + +i=1 + +m +❳ +∂ +ˆ +ˆ +L(w, +ˆ b, ξ, α, β) = − +αi y (i) = 0 +∂b +i=1 +∂ +ˆ α, β) = C − αi − βi = 0. +L(w, +ˆ ˆb, ξ, +∂ξi + +(6) +(7) + +Adding (6) and (7) to the constraints of our dual optimizaton problem, we obtain, +ˆ +θD (α, β) = L(w, +ˆ ˆb, ξ) +m +m +m +❳ +❳ +❳ +1 += wˆ 2 + C +βi ξˆi +αi (1 − ξˆi − y (i) (wˆ T x(i) + ˆb)) − +ξˆi + +2 +i=1 +i=1 +i=1 +m +m +m +❳ +❳ +❳ +1 += wˆ 2 + C +βi ξˆi +ξˆi + +αi (1 − ξˆi − y (i) (wˆ T x(i) )) − +2 +i=1 +i=1 +i=1 +m +❳ +1 +αi (1 − y (i) (wˆ T x(i) )). += wˆ 2 + +2 +i=1 +ˆ for fixed (α, β), the +where the first equality follows from the optimality of (w, +ˆ ˆb, ξ) +second equality uses the definition of the generalized Lagrangian, and the third and +fourth equalities follow from (6) and (7), respectively. Finally, to use (5), observe that +1 +wˆ +2 + +2 + ++ + +m +❳ +i=1 + +m +❳ +1 +wˆ 2 − wˆ T +αi y (i) x(i) +2 +i=1 +i=1 +m +❳ +1 += +αi + wˆ 2 − wˆ 2 +2 +i=1 +m +❳ +1 += +αi − wˆ 2 +2 +i=1 +m +m ❳ +m +❳ +1❳ += +αi − +αi αi y (i) y (j) x(i) , x(j) . +2 i=1 j=1 +i=1 + +αi (1 − y (i) (wˆ T x(i) )) = + +13 + +m +❳ + +αi + + + Therefore, our dual problem (with no more primal variables) is simply +maximize +α,β + +m +❳ + +αi − + +i=1 + +m ❳ +m +1❳ +αi αi y (i) y (j) x(i) , x(j) +2 i=1 j=1 + +subject to αi ≥ 0, +βi ≥ 0, +αi + βi = C, +m +❳ + +i = 1, . . . , m, +i = 1, . . . , m, +i = 1, . . . , m, + +αi y (i) = 0. + +i=1 + +2. KKT complementary. KKT complementarity requires that for any primal optimal +(w∗ , b∗ , ξ ∗ ) and dual optimal (α∗ , β ∗ ), +αi∗ (1 − ξi∗ − y (i) (w∗ T x(i) + b∗ )) = 0 +βi∗ ξi∗ = 0 +for i = 1, . . . , m. From the first condition, we see that if αi > 0, then in order for the +product to be zero, then 1 − ξi∗ − y (i) (w∗ T x(i) + b∗ ) = 0. It follows that +y (i) (w∗ T x(i) + b∗ ) ≤ 1 +since ξ ∗ ≥ 0 by primal feasibility. Similarly, if βi∗ > 0, then ξi∗ = 0 to ensure complementarity. From the primal constraint, y (i) (wT x(i) + b) ≥ 1 − ξi , it follows that +y (i) (w∗ T x(i) + b∗ ) ≥ 1. +Finally, since βi∗ > 0 is equivalent to αi∗ < C (since α∗ + βi∗ = C), we can summarize +the KKT conditions as follows: +αi∗ = 0 ⇒ y (i) (w∗ T x(i) + b∗ ) ≥ 1, +0 < αi∗ < C ⇒ y (i) (w∗ T x(i) + b∗ ) = 1, +αi∗ = C ⇒ y (i) (w∗ T x(i) + b∗ ) ≤ 1. +3. Simplification. We can tidy up our dual problem slightly by observing that each pair +of constraints of the form +βi ≥ 0 + +αi + βi = C + +is equivalent to the single constraint, αi ≤ C; that is, if we solve the optimization +problem +m ❳ +m +1❳ +αi αi y (i) y (j) x(i) , x(j) +α,β +2 +i=1 j=1 +i=1 +subject to 0 ≤ αi ≤ C, + +maximize + +m +❳ + +m +❳ + +αi − + +αi y (i) = 0. + +i=1 + +14 + +i = 1, . . . , m, + +(8) + + and subsequently set βi = C − αi , then it follows that (α, β) will be optimal for the +previous dual problem above. This last form, indeed, is the form of the soft-margin +SVM dual given in the lecture notes. + +15 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-gp.txt b/Lectures/aimlcs229/cs229-gp.txt new file mode 100644 index 0000000..88a3202 --- /dev/null +++ b/Lectures/aimlcs229/cs229-gp.txt @@ -0,0 +1,1243 @@ +Gaussian processes +Chuong B. Do +December 1, 2007 +Many of the classical machine learning algorithms that we talked about during the first +half of this course fit the following pattern: given a training set of i.i.d. examples sampled +from some unknown distribution, +1. solve a convex optimization problem in order to identify the single “best fit” model for +the data, and +2. use this estimated model to make “best guess” predictions for future test input points. +In these notes, we will talk about a different flavor of learning algorithms, known as +Bayesian methods. Unlike classical learning algorithm, Bayesian algorithms do not attempt to identify “best-fit” models of the data (or similarly, make “best guess” predictions +for new test inputs). Instead, they compute a posterior distribution over models (or similarly, +compute posterior predictive distributions for new test inputs). These distributions provide +a useful way to quantify our uncertainty in model estimates, and to exploit our knowledge +of this uncertainty in order to make more robust predictions on new test points. +We focus on regression problems, where the goal is to learn a mapping from some input +space X = Rn of n-dimensional vectors to an output space Y = R of real-valued targets. +In particular, we will talk about a kernel-based fully Bayesian regression algorithm, known +as Gaussian process regression. The material covered in these notes draws heavily on many +different topics that we discussed previously in class (namely, the probabilistic interpretation +of linear regression1 , Bayesian methods2 , kernels3 , and properties of multivariate Gaussians4 ). +The organization of these notes is as follows. In Section 1, we provide a brief review +of multivariate Gaussian distributions and their properties. In Section 2, we briefly review +Bayesian methods in the context of probabilistic linear regression. The central ideas underlying Gaussian processes are presented in Section 3, and we derive the full Gaussian process +regression model in Section 4. +1 + +See +See +3 +See +4 +See +2 + +course +course +course +course + +lecture +lecture +lecture +lecture + +notes +notes +notes +notes + +on +on +on +on + +“Supervised Learning, Discriminative Algorithms.” +“Regularization and Model Selection.” +“Support Vector Machines.” +“Factor Analysis.” + +1 + + 1 + +Multivariate Gaussians + +A vector-valued random variable x ∈ Rn is said to have a multivariate normal (or +Gaussian) distribution with mean µ ∈ Rn and covariance matrix Σ ∈ Sn++ if +p(x; µ, Σ) = + +1 +(2π)n/2 |Σ| + +1 +exp − (x − µ)T Σ−1 (x − µ) . +2 + +(1) + +We write this as x ∼ N (µ, Σ). Here, recall from the section notes on linear algebra that Sn++ +refers to the space of symmetric positive definite n × n matrices.5 +Generally speaking, Gaussian random variables are extremely useful in machine learning +and statistics for two main reasons. First, they are extremely common when modeling “noise” +in statistical algorithms. Quite often, noise can be considered to be the accumulation of a +large number of small independent random perturbations affecting the measurement process; +by the Central Limit Theorem, summations of independent random variables will tend to +“look Gaussian.” Second, Gaussian random variables are convenient for many analytical +manipulations, because many of the integrals involving Gaussian distributions that arise in +practice have simple closed form solutions. In the remainder of this section, we will review +a number of useful properties of multivariate Gaussians. +Consider a random vector x ∈ Rn with x ∼ N (µ, Σ). Suppose also that the variables in x +have been partitioned into two sets xA = [x1 · · · xr ]T ∈ Rr and xB = [xr+1 · · · xn ]T ∈ Rn−r +(and similarly for µ and Σ), such that +x= + +xA +xB + +µA +µB + +µ= + +Σ= + +ΣAA ΣAB +. +ΣBA ΣBB + +Here, ΣAB = ΣTBA since Σ = E[(x − µ)(x − µ)T ] = ΣT . The following properties hold: +1. Normalization. The density function normalizes, i.e., +p(x; µ, Σ)dx = 1. +x + +This property, though seemingly trivial at first glance, turns out to be immensely +useful for evaluating all sorts of integrals, even ones which appear to have no relation +to probability distributions at all (see Appendix A.1)! +2. Marginalization. The marginal densities, +p(xA ) = + +p(xA , xB ; µ, Σ)dxB +xB + +p(xB ) = + +p(xA , xB ; µ, Σ)dxA +xA + +5 + +There are actually cases in which we would want to deal with multivariate Gaussian distributions where +Σ is positive semidefinite but not positive definite (i.e., Σ is not full rank). In such cases, Σ−1 does not exist, +so the definition of the Gaussian density given in (1) does not apply. For instance, see the course lecture +notes on “Factor Analysis.” + +2 + + are Gaussian: +xA ∼ N (µA , ΣAA ) +xB ∼ N (µB , ΣBB ). +3. Conditioning. The conditional densities +p(xA | xB ) = + +p(xA , xB ; µ, Σ) +p(xA , xB ; µ, Σ)dxA +xA + +p(xB | xA ) = + +p(xA , xB ; µ, Σ) +p(xA , xB ; µ, Σ)dxB +xB + +are also Gaussian: +−1 +xA | xB ∼ N µA + ΣAB Σ−1 +BB (xB − µB ), ΣAA − ΣAB ΣBB ΣBA + +−1 +xB | xA ∼ N µB + ΣBA Σ−1 +AA (xA − µA ), ΣBB − ΣBA ΣAA ΣAB . + +A proof of this property is given in Appendix A.2. +4. Summation. The sum of independent Gaussian random variables (with the same +dimensionality), y ∼ N (µ, Σ) and z ∼ N (µ′ , Σ′ ), is also Gaussian: +y + z ∼ N (µ + µ′ , Σ + Σ′ ). + +2 + +Bayesian linear regression + +Let S = {(x(i) , y (i) )}m +i=1 be a training set of i.i.d. examples from some unknown distribution. +The standard probabilistic interpretation of linear regression states that +y (i) = θT x(i) + ε(i) , + +i = 1, . . . , m + +where the ε(i) are i.i.d. “noise” variables with independent N (0, σ 2 ) distributions. It follows +that y (i) − θT x(i) ∼ N (0, σ 2 ), or equivalently, +P (y (i) | x(i) , θ) = √ +For notational convenience, we define + + +— (x(1) )T — + — (x(2) )T —  + + +X= + ∈ Rm×n +.. + + +. +(m) T +— (x ) — + +1 +(y (i) − θT x(i) )2 +. +exp − +2σ 2 +2πσ + + + + +y (1) + y (2)  + + +y =  ..  ∈ Rm + .  +y (m) + +3 + + + + +ε(1) + ε(2)  + + +ε =  ..  ∈ Rm . + .  +ε(m) + + Bayesian linear regression, 95% confidence region + +3 + +2 + +1 + +0 + +−1 + +−2 + +−3 + +−5 + +−4 + +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +4 + +5 + +Figure 1: Bayesian linear regression for a one-dimensional linear regression problem, y (i) = +θx(i) + ǫ(i) , with ǫ(i) ∼ N (0, 1) i.i.d. noise. The green region denotes the 95% confidence +region for predictions of the model. Note that the (vertical) width of the green region is +largest at the ends but narrowest in the middle. This region reflects the uncertain in the +estimates for the parameter θ. In contrast, a classical linear regression model would display +a confidence region of constant width, reflecting only the N (0, σ 2 ) noise in the outputs. +In Bayesian linear regression, we assume that a prior distribution over parameters is +also given; a typical choice, for instance, is θ ∼ N (0, τ 2 I). Using Bayes’s rule, we obtain the +parameter posterior, +p(θ)p(S | θ) +p(θ | S) = += +p(θ′ )p(S | θ′ )dθ′ +θ′ + +p(θ) +p(θ′ ) +θ′ + +m +(i) +i=1 p(y +m +(i) +i=1 p(y + +| x(i) , θ) +. +| x(i) , θ′ )dθ′ + +(2) + +Assuming the same noise model on testing points as on our training points, the “output” of +Bayesian linear regression on a new test point x∗ is not just a single guess “y∗ ”, but rather +an entire probability distribution over possible outputs, known as the posterior predictive +distribution: +p(y∗ | x∗ , S) = + +θ + +p(y∗ | x∗ , θ)p(θ | S)dθ. + +(3) + +For many types of models, the integrals in (2) and (3) are difficult to compute, and hence, +we often resort to approximations, such as MAP estimation (see course lecture notes on +“Regularization and Model Selection”). +In the case of Bayesian linear regression, however, the integrals actually are tractable! In +particular, for Bayesian linear regression, one can show (after much work!) that +1 −1 T +A X y, A−1 +σ2 +1 T −1 T +y∗ | x∗ , S ∼ N +x A X y, xT∗ A−1 x∗ + σ 2 +σ2 ∗ +θ|S∼N + +4 + + where A = σ12 X T X + τ12 I. The derivation of these formulas is somewhat involved.6 Nonetheless, from these equations, we get at least a flavor of what Bayesian methods are all about: the +posterior distribution over the test output y∗ for a test input x∗ is a Gaussian distribution— +this distribution reflects the uncertainty in our predictions y∗ = θT x∗ + ε∗ arising from both +the randomness in ε∗ and the uncertainty in our choice of parameters θ. In contrast, classical +probabilistic linear regression models estimate parameters θ directly from the training data +but provide no estimate of how reliable these learned parameters may be (see Figure 1). + +3 + +Gaussian processes + +As described in Section 1, multivariate Gaussian distributions are useful for modeling finite +collections of real-valued variables because of their nice analytical properties. Gaussian +processes are the extension of multivariate Gaussians to infinite-sized collections of realvalued variables. In particular, this extension will allow us to think of Gaussian processes as +distributions not just over random vectors but in fact distributions over random functions.7 + +3.1 + +Probability distributions over functions with finite domains + +To understand how one might paramterize probability distributions over functions, consider +the following simple example. Let X = {x1 , . . . , xm } be any finite set of elements. Now, +consider the set H of all possible functions mapping from X to R. For instance, one example +of a function h0 (·) ∈ H is given by +h0 (x1 ) = 5, + +h0 (x2 ) = −7, + +h0 (x2 ) = 2.3, + +..., + +h0 (xm−1 ) = −π, + +h0 (xm ) = 8. + +Since the domain of any h(·) ∈ H has only m elements, we can always represent h(·) comT +pactly as an m-dimensional vector, h = h(x1 ) h(x2 ) · · · h(xm ) . In order to specify +a probability distribution over functions h(·) ∈ H, we must associate some “probability +density” with each function in H. One natural way to do this is to exploit the one-to-one +correspondence between functions h(·) ∈ H and their vector representations, h. In particular, if we specify that h ∼ N (µ, σ 2 I), then this in turn implies a probability distribution +over functions h(·), whose probability density function is given by +m + +p(h) = +i=1 + +√ + +1 +1 +exp − 2 (h(xi ) − µi )2 . +2σ +2πσ + +6 + +For the complete derivation, see, for instance, [1]. Alternatively, read the Appendices, which gives a +number of arguments based on the “completion-of-squares” trick, and derive this formula yourself! +7 +Let H be a class of functions mapping from X → Y. A random function h(·) from H is a function which +is randomly drawn from H, according to some probability distribution over H. One potential source of +confusion is that you may be tempted to think of random functions as functions whose outputs are in some +way stochastic; this is not the case. Instead, a random function h(·), once selected from H probabilistically, +implies a deterministic mapping from inputs in X to outputs in Y. + +5 + + In the example above, we showed that probability distributions over functions with finite +domains can be represented using a finite-dimensional multivariate Gaussian distribution +over function outputs h(x1 ), . . . , h(xm ) at a finite number of input points x1 , . . . , xm . How +can we specify probability distributions over functions when the domain size may be infinite? +For this, we turn to a fancier type of probability distribution known as a Gaussian process. + +3.2 + +Probability distributions over functions with infinite domains + +A stochastic process is a collection of random variables, {h(x) : x ∈ X }, indexed by elements +from some set X , known as the index set.8 A Gaussian process is a stochastic process such +that any finite subcollection of random variables has a multivariate Gaussian distribution. +In particular, a collection of random variables {h(x) : x ∈ X } is said to be drawn from a +Gaussian process with mean function m(·) and covariance function k(·, ·) if for any finite +set of elements x1 , . . . , xm ∈ X , the associated finite set of random variables h(x1 ), . . . , h(xm ) +have distribution, + +  + + + +k(x1 , x1 ) · · · k(x1 , xm ) +m(x1 ) +h(x1 ) + + .   + ..  +.. +.. +... +. + .  ∼ N  ..  ,  +. +. +k(xm , x1 ) · · · k(xm , xm ) +m(xm ) +h(xm ) +We denote this using the notation, + +h(·) ∼ GP(m(·), k(·, ·)). +Observe that the mean function and covariance function are aptly named since the above +properties imply that +m(x) = E[x] +k(x, x′ ) = E[(x − m(x))(x′ − m(x′ )). +for any x, x′ ∈ X . +Intuitively, one can think of a function h(·) drawn from a Gaussian process prior as an +extremely high-dimensional vector drawn from an extremely high-dimensional multivariate +Gaussian. Here, each dimension of the Gaussian corresponds to an element x from the index +set X , and the corresponding component of the random vector represents the value of h(x). +Using the marginalization property for multivariate Gaussians, we can obtain the marginal +multivariate Gaussian density corresponding to any finite subcollection of variables. +What sort of functions m(·) and k(·, ·) give rise to valid Gaussian processes? In general, +any real-valued function m(·) is acceptable, but for k(·, ·), it must be the case that for any +8 + +Often, when X = R, one can interpret the indices x ∈ X as representing times, and hence the variables +h(x) represent the temporal evolution of some random quantity over time. In the models that are used for +Gaussian process regression, however, the index set is taken to be the input space of our regression problem. + +6 + + 2 + +2 + +2 + +Samples from GP with k(x,z) = exp(−||x−z|| / (2*tau )), tau = 0.500000 + +2 + +2 + +Samples from GP with k(x,z) = exp(−||x−z|| / (2*tau )), tau = 2.000000 + +2.5 + +2 + +Samples from GP with k(x,z) = exp(−||x−z|| / (2*tau )), tau = 10.000000 + +2 + +1.5 + +1.5 + +1 + +1 + +0.5 + +2 +1.5 +1 +0.5 + +0.5 + +0 + +0 + +−0.5 + +−0.5 + +−1 + +−1 + +−1.5 + +0 +−0.5 +−1 +−1.5 +−2 +−2.5 + +0 + +1 + +2 + +3 + +4 + +5 + +(a) + +6 + +7 + +8 + +9 + +10 + +−1.5 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +8 + +9 + +10 + +(b) + +−2 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +8 + +9 + +10 + +(c) + +Figure 2: Samples from a zero-mean Gaussian process prior with kSE (·, ·) covariance function, +using (a) τ = 0.5, (b) τ = 2, and (c) τ = 10. Note that as the bandwidth parameter τ +increases, then points which are farther away will have higher correlations than before, and +hence the sampled functions tend to be smoother overall. +set of elements x1 , . . . , xm ∈ X , the resulting matrix + + +k(x1 , x1 ) · · · k(x1 , xm ) + + +.. +.. +... +K= + +. +. +k(xm , x1 ) · · · k(xm , xm ) + +is a valid covariance matrix corresponding to some multivariate Gaussian distribution. A +standard result in probability theory states that this is true provided that K is positive +semidefinite. Sound familiar? +The positive semidefiniteness requirement for covariance matrices computed based on +arbitrary input points is, in fact, identical to Mercer’s condition for kernels! A function k(·, ·) +is a valid kernel provided the resulting kernel matrix K defined as above is always positive +semidefinite for any set of input points x1 , . . . , xm ∈ X . Gaussian processes, therefore, are +kernel-based probability distributions in the sense that any valid kernel function can be used +as a covariance function! + +3.3 + +The squared exponential kernel + +In order to get an intuition for how Gaussian processes work, consider a simple zero-mean +Gaussian process, +h(·) ∼ GP(0, k(·, ·)). +defined for functions h : X → R where we take X = R. Here, we choose the kernel function +k(·, ·) to be the squared exponential9 kernel function, defined as +kSE (x, x′ ) = exp − +9 + +1 +||x − x′ ||2 +2τ 2 + +In the context of SVMs, we called this the Gaussian kernel; to avoid confusion with “Gaussian” processes, +we refer to this kernel here as the squared exponential kernel, even though the two are formally identical. + +7 + + for some τ > 0. What do random functions sampled from this Gaussian process look like? +In our example, since we use a zero-mean Gaussian process, we would expect that for +the function values from our Gaussian process will tend to be distributed around zero. +Furthermore, for any pair of elements x, x′ ∈ X . +• h(x) and h(x′ ) will tend to have high covariance x and x′ are “nearby” in the input +space (i.e., ||x − x′ || = |x − x′ | ≈ 0, so exp(− 2τ12 ||x − x′ ||2 ) ≈ 1). +• h(x) and h(x′ ) will tend to have low covariance when x and x′ are “far apart” (i.e., +||x − x′ || ≫ 0, so exp(− 2τ12 ||x − x′ ||2 ) ≈ 0). +More simply stated, functions drawn from a zero-mean Gaussian process prior with the +squared exponential kernel will tend to be “locally smooth” with high probability; i.e., +nearby function values are highly correlated, and the correlation drops off as a function of +distance in the input space (see Figure 2). + +4 + +Gaussian process regression + +As discussed in the last section, Gaussian processes provide a method for modelling probability distributions over functions. Here, we discuss how probability distributions over functions +can be used in the framework of Bayesian regression. + +4.1 + +The Gaussian process regression model + +Let S = {(x(i) , y (i) )}m +i=1 be a training set of i.i.d. examples from some unknown distribution. +In the Gaussian process regression model, +y (i) = h(x(i) ) + ε(i) , + +i = 1, . . . , m + +where the ε(i) are i.i.d. “noise” variables with independent N (0, σ 2 ) distributions. Like in +Bayesian linear regression, we also assume a prior distribution over functions h(·); in +particular, we assume a zero-mean Gaussian process prior, +h(·) ∼ GP(0, k(·, ·)) +for some valid covariance function k(·, ·). +(i) (i) +∗ +Now, let T = {(x∗ , y∗ )}m +i=1 be a set of i.i.d. testing points drawn from the same unknown + +8 + + distribution as S.10 For notational convenience, we define + + + + + + + + +y (1) +ε(1) +h(x(1) ) +— (x(1) )T — + y (2)  + ε(2)  + h(x(2) )  + — (x(2) )T —  + + + + + + + + +m×n +, +y += +, +ε += +h += +∈ +R +X= + ..  ∈ Rm , + ..  + + ..  +.. + + + + + + + +.  +. +. +. +(m) +(m) +(m) +(m) T +y +ε +h(x ) +— (x ) — + + + + + + + +(1)  +(1) +(1) +(1) +y∗ +ε∗ +h(x∗ ) +— (x∗ )T — + y (2)  + ε(2)  + h(x(2) )  + — (x(2) )T —  +∗ +∗ + ∗  + ∗  + + + + +m∗ ×n +h∗ =  +X∗ =  + , ε∗ =  .  , y∗ =  .  ∈ Rm∗ . +∈R +. +. +.. +.. + ..  + ..  + + + + +(m ) +(m) +(m∗ ) +(m∗ ) T +y∗ ∗ +ε∗ +h(x∗ ) +— (x∗ ) — + +Given the training data S, the prior p(h), and the testing inputs X∗ , how can we compute +the posterior predictive distribution over the testing outputs y∗ ? For Bayesian linear regression in Section 2, we used Bayes’s rule in order to compute the paramter posterior, which we +then used to compute posterior predictive distribution p(y∗ | x∗ , S) for a new test point x∗ . +For Gaussian process regression, however, it turns out that an even simpler solution exists! + +4.2 + +Prediction + +Recall that for any function h(·) drawn from our zero-mean Gaussian process prior with +covariance function k(·, ·), the marginal distribution over any set of input points belonging +to X must have a joint multivariate Gaussian distribution. In particular, this must hold for +the training and test points, so we have +h +h∗ + +X, X∗ ∼ N 0, + +K(X, X) K(X, X∗ ) +K(X∗ , X) K(X∗ , X∗ ) + +, + +where +h ∈ Rm such that h = h(x(1) ) · · · h(x(m) ) +h∗ ∈ Rm∗ such that h∗ = h(x(1) +∗ ) ··· + +T + +(m) + +h(x∗ ) + +K(X, X) ∈ Rm×m such that (K(X, X))ij = k(x(i) , x(j) ) + +K(X, X∗ ) ∈ Rm×m∗ such that (K(X, X∗ ))ij = k(x(i) , x(j) +∗ ) + +(j) +K(X∗ , X) ∈ Rm∗ ×m such that (K(X∗ , X))ij = k(x(i) +∗ ,x ) + +(j) +K(X∗ , X∗ ) ∈ Rm∗ ×m∗ such that (K(X∗ , X∗ ))ij = k(x(i) +∗ , x∗ ). + +From our i.i.d. noise assumption, we have that +ε +σ2I 0 +∼ N 0, T +ε∗ +0 +σ2I +10 + +We assume also that T are S are mutually independent. + +9 + +. + +T + + Gaussian process regression, 95% confidence region + +Gaussian process regression, 95% confidence region + +1.5 + +Gaussian process regression, 95% confidence region + +1.5 + +1.5 + +1 + +1 + +1 + +0.5 + +0.5 + +0.5 + +0 + +0 + +0 + +−0.5 + +−0.5 + +−0.5 + +−1 + +−1 + +−1 + +−1.5 + +−1.5 + +−1.5 + +−2 + +−2 + +−2 + +−2.5 + +−2.5 + +−2.5 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +8 + +(a) + +9 + +10 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +8 + +9 + +10 + +0 + +1 + +2 + +3 + +(b) + +4 + +5 + +6 + +7 + +8 + +9 + +10 + +(c) + +Figure 3: Gaussian process regression using a zero-mean Gaussian process prior with kSE (·, ·) +covariance function (where τ = 0.1), with noise level σ = 1, and (a) m = 10, (b) m = 20, and +(c) m = 40 training examples. The blue line denotes the mean of the posterior predictive +distribution, and the green shaded region denotes the 95% confidence region based on the +model’s variance estimates. As the number of training examples increases, the size of the +confidence region shrinks to reflect the diminishing uncertainty in the model estimates. Note +also that in panel (a), the 95% confidence region shrinks near training points but is much +larger far away from training points, as one would expect. +The sums of independent Gaussian random variables is also Gaussian, so +y +y∗ + +X, X∗ = + +h +ε +K(X, X) + σ 2 I +K(X, X∗ ) +∼ N 0, ++ +ε∗ +K(X∗ , X) +K(X∗ , X∗ ) + σ 2 I +h∗ + +. + +Now, using the rules for conditioning Gaussians, it follows that +y∗ | y, X, X∗ ∼ N (µ∗ , Σ∗ ) +where +µ∗ = K(X∗ , X)(K(X, X) + σ 2 I)−1 y +Σ∗ = K(X∗ , X∗ ) + σ 2 I − K(X∗ , X)(K(X, X) + σ 2 I)−1 K(X, X∗ ). +And that’s it! Remarkably, performing prediction in a Gaussian process regression model is +very simple, despite the fact that Gaussian processes in themselves are fairly complicated!11 + +5 + +Summary + +We close our discussion of our Gaussian processes by pointing out some reasons why Gaussian +processes are an attractive model for use in regression problems and in some cases may be +preferable to alternative models (such as linear and locally-weighted linear regression): +11 + +Interestingly, it turns out that Bayesian linear regression, when “kernelized” in the proper way, turns +out to be exactly equivalent to Gaussian process regression! But the derivation of the posterior predictive +distribution is far more complicated for Bayesian linear regression, and the effort needed to kernelize the +algorithm is even greater. The Gaussian process perspective is certainly much easier! + +10 + + 1. As Bayesian methods, Gaussian process models allow one to quantify uncertainty in +predictions resulting not just from intrinsic noise in the problem but also the errors +in the parameter estimation procedure. Furthermore, many methods for model selection and hyperparameter selection in Bayesian methods are immediately applicable to +Gaussian processes (though we did not address any of these advanced topics here). +2. Like locally-weighted linear regression, Gaussian process regression is non-parametric +and hence can model essentially arbitrary functions of the input points. +3. Gaussian process regression models provide a natural way to introduce kernels into a +regression modeling framework. By careful choice of kernels, Gaussian process regression models can sometimes take advantage of structure in the data (though, we also +did not examine this issue here). +4. Gaussian process regression models, though perhaps somewhat tricky to understand +conceptually, nonetheless lead to simple and straightforward linear algebra implementations. + +References +[1] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine +Learning. MIT Press, 2006. Online: http://www.gaussianprocess.org/gpml/ + +11 + + Appendix A.1 +In this example, we show how the normalization property for multivariate Gaussians can be +used to compute rather intimidating multidimensional integrals without performing any real +calculus! Suppose you wanted to compute the following multidimensional integral, +I(A, b, c) = + +1 +exp − xT Ax − xT b − c dx, +2 +x + +m +for some A ∈ Sm +++ , b ∈ R , and c ∈ R. Although one could conceivably perform the +multidimensional integration directly (good luck!), a much simpler line of reasoning is based +on a mathematical trick known as “completion-of-squares.” In particular, + +1 +exp − xT Ax − xT AA−1 b dx +2 +x +1 += exp (−c) · exp − (x − A−1 b)T A(x − A−1 b) − bT A−1 b dx +2 +x +1 += exp −c − bT A−1 b · exp − (x − A−1 b)T A(x − A−1 b) dx. +2 +x + +I(A, b, c) = exp (−c) · + +Defining µ = A−1 b and Σ = A−1 , it follows that I(A, b, c) is equal to +1 +(2π)m/2 |Σ| +· +T +−1 +m/2 +exp (c + b A b) (2π) |Σ| + +1 +exp − (x − µ)T Σ−1 (x − µ) dx . +2 +x + +However, the term in brackets is identical in form to the integral of a multivariate Gaussian! +Since we know that a Gaussian density normalizes, it follows that the term in brackets is +equal to 1. Therefore, +I(A, b, c) = + +(2π)m/2 |A−1 | +. +exp (c + bT A−1 b) + +Appendix A.2 +We derive the form of the distribution of xA given xB ; the other result follows immediately +by symmetry. Note that +p(xA | xB ) = + +1 +1 +1 +· +exp − (x − µ)T Σ−1 (x − µ) +m/2 +2 +p(xA , xB ; µ, Σ)dxA (2π) |Σ| +xA + +1 +1 +exp − += +Z1 +2 + +T + +µ +xA +− A +µB +xB + +VAA VAB +VBA VBB + +µ +xA +− A +µB +xB + +where Z1 is a proportionality constant which does not depend on xA , and +Σ−1 = V = + +VAA VAB +. +VBA VBB +12 + + To simplify this expression, observe that +µ +xA +− A +µB +xB + +T + +VAA VAB +VBA VBB + +µ +xA +− A +µB +xB + += (xA − µA )T VAA (xA − µA ) + (xA − µA )T VAB (xB − µB ) + ++ (xB − µB )T VBA (xA − µA ) + (xB − µB )T VBB (xB − µB ). + +T +Retaining only terms dependent on xA (and using the fact that VAB = VBA +), we have + +p(xA | xB ) = + +1 +1 +exp − xTA VAA xA − 2xTA VAA µA + 2xTA VAB (xB − µB ) +Z2 +2 + +where Z2 is a new proportionality constant which again does not depend on xA . Finally, +using the “completion-of-squares” argument (see Appendix A.1), we have +p(xA | xB ) = + +1 +1 +exp − (xA − µ′ )T VAA (xA − µ′ ) +Z3 +2 + +where Z3 is again a new proportionality constant not depending on xA , and where mu′ = +−1 +VAB (xB −µB ). This last statement shows that the distribution of xA , conditioned on +µA −VAA +xB , again has the form of a multivariate Gaussian. In fact, from the normalization property, +it follows immediately that +−1 +−1 +xA | xB ∼ N (µA − VAA +VAB (xB − µB ), VAA +). + +To complete the proof, we simply note that +−1 +−1 +−1 +VAA VAB +(ΣAA − ΣAB Σ−1 +−(ΣAA − ΣAB Σ−1 +BB ΣBA ) +BB ΣBA ) ΣAB ΣBB += +−1 +−1 +−1 +VBA VBB +−ΣBB ΣBA (ΣAA − ΣAB ΣBB ΣBA )−1 +(ΣBB − ΣBA ΣAA ΣAB )−1 + +follows from standard formulas for the inverse of a partitioned matrix. Substituting the +relevant blocks into the previous expression gives the desired result. + +13 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-hmm.txt b/Lectures/aimlcs229/cs229-hmm.txt new file mode 100644 index 0000000..1e082a4 --- /dev/null +++ b/Lectures/aimlcs229/cs229-hmm.txt @@ -0,0 +1,1683 @@ +❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧s ❋✉♥❞❛♠❡♥t❛❧s +❉❛♥✐❡❧ ❘❛♠❛❣❡ +❈❙✷✷✾ ❙❡❝t✐♦♥ ◆♦t❡s +❉❡❝❡♠❜❡r ✶✱ ✷✵✵✼ + +❆❜str❛❝t + +❍♦✇ ❝❛♥ ✇❡ ❛♣♣❧② ♠❛❝❤✐♥❡ ❧❡❛r♥✐♥❣ t♦ ❞❛t❛ t❤❛t ✐s r❡♣r❡s❡♥t❡❞ ❛s ❛ +s❡q✉❡♥❝❡ ♦❢ ♦❜s❡r✈❛t✐♦♥s ♦✈❡r t✐♠❡❄ ❋♦r ✐♥st❛♥❝❡✱ ✇❡ ♠✐❣❤t ❜❡ ✐♥t❡r❡st❡❞ +✐♥ ❞✐s❝♦✈❡r✐♥❣ t❤❡ s❡q✉❡♥❝❡ ♦❢ ✇♦r❞s t❤❛t s♦♠❡♦♥❡ s♣♦❦❡ ❜❛s❡❞ ♦♥ ❛♥ +❛✉❞✐♦ r❡❝♦r❞✐♥❣ ♦❢ t❤❡✐r s♣❡❡❝❤✳ ❖r ✇❡ ♠✐❣❤t ❜❡ ✐♥t❡r❡st❡❞ ✐♥ ❛♥♥♦t❛t✐♥❣ +❛ s❡q✉❡♥❝❡ ♦❢ ✇♦r❞s ✇✐t❤ t❤❡✐r ♣❛rt✲♦❢✲s♣❡❡❝❤ t❛❣s✳ ❚❤❡s❡ ♥♦t❡s ♣r♦✈✐❞❡s ❛ +t❤♦r♦✉❣❤ ♠❛t❤❡♠❛t✐❝❛❧ ✐♥tr♦❞✉❝t✐♦♥ t♦ t❤❡ ❝♦♥❝❡♣t ♦❢ ▼❛r❦♦✈ ▼♦❞❡❧s ✖ +❛ ❢♦r♠❛❧✐s♠ ❢♦r r❡❛s♦♥✐♥❣ ❛❜♦✉t st❛t❡s ♦✈❡r t✐♠❡ ✖ ❛♥❞ ❍✐❞❞❡♥ ▼❛r❦♦✈ +▼♦❞❡❧s ✖ ✇❤❡r❡ ✇❡ ✇✐s❤ t♦ r❡❝♦✈❡r ❛ s❡r✐❡s ♦❢ st❛t❡s ❢r♦♠ ❛ s❡r✐❡s ♦❢ +♦❜s❡r✈❛t✐♦♥s✳ ❚❤❡ ✜♥❛❧ s❡❝t✐♦♥ ✐♥❝❧✉❞❡s s♦♠❡ ♣♦✐♥t❡rs t♦ r❡s♦✉r❝❡s t❤❛t +♣r❡s❡♥t t❤✐s ♠❛t❡r✐❛❧ ❢r♦♠ ♦t❤❡r ♣❡rs♣❡❝t✐✈❡s✳ +✶ + +▼❛r❦♦✈ ▼♦❞❡❧s + +S = {s1 , s2 , ...s|S| } ✇❡ ❝❛♥ ♦❜s❡r✈❡ ❛ s❡r✐❡s ♦✈❡r t✐♠❡ +z ∈ S T ✳ ❋♦r ❡①❛♠♣❧❡✱ ✇❡ ♠✐❣❤t ❤❛✈❡ t❤❡ st❛t❡s ❢r♦♠ ❛ ✇❡❛t❤❡r s②st❡♠ S = +{sun, cloud, rain} ✇✐t❤ |S| = 3 ❛♥❞ ♦❜s❡r✈❡ t❤❡ ✇❡❛t❤❡r ♦✈❡r ❛ ❢❡✇ ❞❛②s {z1 = +ssun , z2 = scloud , z3 = scloud , z4 = srain , z5 = scloud } ✇✐t❤ T = 5✳ +●✐✈❡♥ ❛ s❡t ♦❢ st❛t❡s + +❚❤❡ ♦❜s❡r✈❡❞ st❛t❡s ♦❢ ♦✉r ✇❡❛t❤❡r ❡①❛♠♣❧❡ r❡♣r❡s❡♥t t❤❡ ♦✉t♣✉t ♦❢ ❛ r❛♥❞♦♠ +♣r♦❝❡ss ♦✈❡r t✐♠❡✳ ❲✐t❤♦✉t s♦♠❡ ❢✉rt❤❡r ❛ss✉♠♣t✐♦♥s✱ st❛t❡ + +sj + +❛t t✐♠❡ + +t + +❝♦✉❧❞ + +❜❡ ❛ ❢✉♥❝t✐♦♥ ♦❢ ❛♥② ♥✉♠❜❡r ♦❢ ✈❛r✐❛❜❧❡s✱ ✐♥❝❧✉❞✐♥❣ ❛❧❧ t❤❡ st❛t❡s ❢r♦♠ t✐♠❡s +t♦ + +t−1 + +1 + +❛♥❞ ♣♦ss✐❜❧② ♠❛♥② ♦t❤❡rs t❤❛t ✇❡ ❞♦♥✬t ❡✈❡♥ ♠♦❞❡❧✳ ❍♦✇❡✈❡r✱ ✇❡ ✇✐❧❧ + +♠❛❦❡ t✇♦ ▼❛r❦♦✈ ❛ss✉♠♣t✐♦♥s t❤❛t ✇✐❧❧ ❛❧❧♦✇ ✉s t♦ tr❛❝t❛❜❧② r❡❛s♦♥ ❛❜♦✉t +t✐♠❡ s❡r✐❡s✳ +❚❤❡ ❧✐♠✐t❡❞ ❤♦r✐③♦♥ ❛ss✉♠♣t✐♦♥ ✐s t❤❛t t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ❜❡✐♥❣ ✐♥ ❛ +st❛t❡ ❛t t✐♠❡ + +t ❞❡♣❡♥❞s ♦♥❧② ♦♥ t❤❡ st❛t❡ ❛t t✐♠❡ t − 1✳ ❚❤❡ ✐♥t✉✐t✐♦♥ ✉♥❞❡r❧②✐♥❣ +t r❡♣r❡s❡♥ts ✏❡♥♦✉❣❤✑ s✉♠♠❛r② ♦❢ t❤❡ + +t❤✐s ❛ss✉♠♣t✐♦♥ ✐s t❤❛t t❤❡ st❛t❡ ❛t t✐♠❡ + +♣❛st t♦ r❡❛s♦♥❛❜❧② ♣r❡❞✐❝t t❤❡ ❢✉t✉r❡✳ ❋♦r♠❛❧❧②✿ + +P (zt |zt−1 , zt−2 , ..., z1 ) = P (zt |zt−1 ) +❚❤❡ st❛t✐♦♥❛r② ♣r♦❝❡ss ❛ss✉♠♣t✐♦♥ ✐s t❤❛t t❤❡ ❝♦♥❞✐t✐♦♥❛❧ ❞✐str✐❜✉t✐♦♥ +♦✈❡r ♥❡①t st❛t❡ ❣✐✈❡♥ ❝✉rr❡♥t st❛t❡ ❞♦❡s ♥♦t ❝❤❛♥❣❡ ♦✈❡r t✐♠❡✳ ❋♦r♠❛❧❧②✿ + +✶ + + P (zt |zt−1 ) = P (z2 |z1 ); t ∈ 2...T +❆s ❛ ❝♦♥✈❡♥t✐♦♥✱ ✇❡ ✇✐❧❧ ❛❧s♦ ❛ss✉♠❡ t❤❛t t❤❡r❡ ✐s ❛♥ ✐♥✐t✐❛❧ st❛t❡ ❛♥❞ ✐♥✐t✐❛❧ + +z0 ≡ s0 ✱ ✇❤❡r❡ s0 r❡♣r❡s❡♥ts t❤❡ ✐♥✐t✐❛❧ ♣r♦❜❛❜✐❧✐t② ❞✐str✐❜✉t✐♦♥ ♦✈❡r +0✳ ❚❤✐s ♥♦t❛t✐♦♥❛❧ ❝♦♥✈❡♥✐❡♥❝❡ ❛❧❧♦✇s ✉s t♦ ❡♥❝♦❞❡ ♦✉r ❜❡❧✐❡❢ +❛❜♦✉t t❤❡ ♣r✐♦r ♣r♦❜❛❜✐❧✐t② ♦❢ s❡❡✐♥❣ t❤❡ ✜rst r❡❛❧ st❛t❡ z1 ❛s P (z1 |z0 )✳ ◆♦t❡ +t❤❛t P (zt |zt−1 , ..., z1 ) = P (zt |zt−1 , ..., z1 , z0 ) ❜❡❝❛✉s❡ ✇❡✬✈❡ ❞❡✜♥❡❞ z0 = s0 ❢♦r +♦❜s❡r✈❛t✐♦♥ + +st❛t❡s ❛t t✐♠❡ + +❛♥② st❛t❡ s❡q✉❡♥❝❡✳ ✭❖t❤❡r ♣r❡s❡♥t❛t✐♦♥s ♦❢ ❍▼▼s s♦♠❡t✐♠❡s r❡♣r❡s❡♥t t❤❡s❡ + +π ∈ R|S| ✳✮ + +♣r✐♦r ❜❡❧✐❡✈❡s ✇✐t❤ ❛ ✈❡❝t♦r + +❲❡ ♣❛r❛♠❡tr✐③❡ t❤❡s❡ tr❛♥s✐t✐♦♥s ❜② ❞❡✜♥✐♥❣ ❛ st❛t❡ tr❛♥s✐t✐♦♥ ♠❛tr✐① A ∈ +R(|S|+1)×(|S|+1) ✳ ❚❤❡ ✈❛❧✉❡ Aij ✐s t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ tr❛♥s✐t✐♦♥✐♥❣ ❢r♦♠ st❛t❡ i +t♦ st❛t❡ j ❛t ❛♥② t✐♠❡ t✳ ❋♦r ♦✉r s✉♥ ❛♥❞ r❛✐♥ ❡①❛♠♣❧❡✱ ✇❡ ♠✐❣❤t ❤❛✈❡ ❢♦❧❧♦✇✐♥❣ +tr❛♥s✐t✐♦♥ ♠❛tr✐①✿ + +s0 +ssun +scloud +srain + +A= + +s0 +0 +0 +0 +0 + +ssun +.33 +.8 +.2 +.1 + +scloud +.33 +.1 +.6 +.2 + +srain +.33 +.1 +.2 +.7 + +◆♦t❡ t❤❛t t❤❡s❡ ♥✉♠❜❡rs ✭✇❤✐❝❤ ■ ♠❛❞❡ ✉♣✮ r❡♣r❡s❡♥t t❤❡ ✐♥t✉✐t✐♦♥ t❤❛t t❤❡ +✇❡❛t❤❡r ✐s s❡❧❢✲❝♦rr❡❧❛t❡❞✿ ✐❢ ✐t✬s s✉♥♥② ✐t ✇✐❧❧ t❡♥❞ t♦ st❛② s✉♥♥②✱ ❝❧♦✉❞② ✇✐❧❧ +st❛② ❝❧♦✉❞②✱ ❡t❝✳ + +❚❤✐s ♣❛tt❡r♥ ✐s ❝♦♠♠♦♥ ✐♥ ♠❛♥② ▼❛r❦♦✈ ♠♦❞❡❧s ❛♥❞ ❝❛♥ + +❜❡ ♦❜s❡r✈❡❞ ❛s ❛ str♦♥❣ ❞✐❛❣♦♥❛❧ ✐♥ t❤❡ tr❛♥s✐t✐♦♥ ♠❛tr✐①✳ +❡①❛♠♣❧❡✱ ♦✉r ✐♥✐t✐❛❧ st❛t❡ + +s0 + +◆♦t❡ t❤❛t ✐♥ t❤✐s + +s❤♦✇s ✉♥✐❢♦r♠ ♣r♦❜❛❜✐❧✐t② ♦❢ tr❛♥s✐t✐♦♥✐♥❣ t♦ ❡❛❝❤ + +♦❢ t❤❡ t❤r❡❡ st❛t❡s ✐♥ ♦✉r ✇❡❛t❤❡r s②st❡♠✳ + +✶✳✶ + +❚✇♦ q✉❡st✐♦♥s ♦❢ ❛ ▼❛r❦♦✈ ▼♦❞❡❧ + +❈♦♠❜✐♥✐♥❣ t❤❡ ▼❛r❦♦✈ ❛ss✉♠♣t✐♦♥s ✇✐t❤ ♦✉r st❛t❡ tr❛♥s✐t✐♦♥ ♣❛r❛♠❡tr✐③❛t✐♦♥ + +A✱ + +✇❡ ❝❛♥ ❛♥s✇❡r t✇♦ ❜❛s✐❝ q✉❡st✐♦♥s ❛❜♦✉t ❛ s❡q✉❡♥❝❡ ♦❢ st❛t❡s ✐♥ ❛ ▼❛r❦♦✈ + +❝❤❛✐♥✳ ❲❤❛t ✐s t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ❛ ♣❛rt✐❝✉❧❛r s❡q✉❡♥❝❡ ♦❢ st❛t❡s +❞♦ ✇❡ ❡st✐♠❛t❡ t❤❡ ♣❛r❛♠❡t❡rs ♦❢ ♦✉r ♠♦❞❡❧ +♦❢ ❛♥ ♦❜s❡r✈❡❞ s❡q✉❡♥❝❡ + +✶✳✶✳✶ + +A + +z? + +❆♥❞ ❤♦✇ + +s✉❝❤ t♦ ♠❛①✐♠✐③❡ t❤❡ ❧✐❦❡❧✐❤♦♦❞ + +z❄ + +Pr♦❜❛❜✐❧✐t② ♦❢ ❛ st❛t❡ s❡q✉❡♥❝❡ + +❲❡ ❝❛♥ ❝♦♠♣✉t❡ t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ❛ ♣❛rt✐❝✉❧❛r s❡r✐❡s ♦❢ st❛t❡s + +z + +❜② ✉s❡ ♦❢ t❤❡ + +❝❤❛✐♥ r✉❧❡ ♦❢ ♣r♦❜❛❜✐❧✐t②✿ + +P (z) + += P (zt , zt−1 , ..., z1 ; A) += P (zt , zt−1 , ..., z1 , z0 ; A) += P (zt |zt−1 , zt−2 , ..., z1 ; A)P (zt−1 |zt−2 , ..., z1 ; A)...P (z1 |z0 ; A) += P (zt |zt−1 ; A)P (zt−1 |zt−2 ; A)...P (z2 |z1 ; A)P (z1 |z0 ; A) +✷ + + T + +P (zt |zt−1 ; A) + += +t=1 +T + += + +Azt−1 zt +t=1 + +■♥ t❤❡ s❡❝♦♥❞ ❧✐♥❡ ✇❡ ✐♥tr♦❞✉❝❡ +❜② t❤❡ ❞❡✜♥✐t✐♦♥ ♦❢ + +z0 + +❛❜♦✈❡✳ + +z0 + +✐♥t♦ ♦✉r ❥♦✐♥t ♣r♦❜❛❜✐❧✐t②✱ ✇❤✐❝❤ ✐s ❛❧❧♦✇❡❞ + +❚❤❡ t❤✐r❞ ❧✐♥❡ ✐s tr✉❡ ♦❢ ❛♥② ❥♦✐♥t ❞✐str✐❜✉t✐♦♥ + +❜② t❤❡ ❝❤❛✐♥ r✉❧❡ ♦❢ ♣r♦❜❛❜✐❧✐t✐❡s ♦r r❡♣❡❛t❡❞ ❛♣♣❧✐❝❛t✐♦♥ ♦❢ ❇❛②❡s r✉❧❡✳ + +❚❤❡ + +❢♦✉rt❤ ❧✐♥❡ ❢♦❧❧♦✇s ❢r♦♠ t❤❡ ▼❛r❦♦✈ ❛ss✉♠♣t✐♦♥s ❛♥❞ t❤❡ ❧❛st ❧✐♥❡ r❡♣r❡s❡♥ts +t❤❡s❡ t❡r♠s ❛s t❤❡✐r ❡❧❡♠❡♥ts ✐♥ ♦✉r tr❛♥s✐t✐♦♥ ♠❛tr✐① + +A✳ + +▲❡t✬s ❝♦♠♣✉t❡ t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ♦✉r ❡①❛♠♣❧❡ t✐♠❡ s❡q✉❡♥❝❡ ❢r♦♠ ❡❛r❧✐❡r✳ ❲❡ + +P (z1 = ssun , z2 = scloud , z3 = srain , z4 = srain , z5 = scloud ) ✇❤✐❝❤ ❝❛♥ ❜❡ +P (ssun |s0 )P (scloud |ssun )P (srain |scloud )P (srain |srain )P (scloud |srain ) = +.33 × .1 × .2 × .7 × .2✳ + +✇❛♥t + +❢❛❝t♦r❡❞ ❛s + +✶✳✶✳✷ + +▼❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣❛r❛♠❡t❡r ❛ss✐❣♥♠❡♥t + +❋r♦♠ ❛ ❧❡❛r♥✐♥❣ ♣❡rs♣❡❝t✐✈❡✱ ✇❡ ❝♦✉❧❞ s❡❡❦ t♦ ✜♥❞ t❤❡ ♣❛r❛♠❡t❡rs +♠✐③❡ t❤❡ ❧♦❣✲❧✐❦❡❧✐❤♦♦❞ ♦❢ s❡q✉❡♥❝❡ ♦❢ ♦❜s❡r✈❛t✐♦♥s + +z✳ + +A + +t❤❛t ♠❛①✐✲ + +❚❤✐s ❝♦rr❡s♣♦♥❞s t♦ ✜♥❞✲ + +✐♥❣ t❤❡ ❧✐❦❡❧✐❤♦♦❞s ♦❢ tr❛♥s✐t✐♦♥✐♥❣ ❢r♦♠ s✉♥♥② t♦ ❝❧♦✉❞② ✈❡rs✉s s✉♥♥② t♦ s✉♥♥②✱ +❡t❝✳✱ t❤❛t ♠❛❦❡ ❛ s❡t ♦❢ ♦❜s❡r✈❛t✐♦♥s ♠♦st ❧✐❦❡❧②✳ ▲❡t✬s ❞❡✜♥❡ t❤❡ ❧♦❣✲❧✐❦❡❧✐❤♦♦❞ +❛ ▼❛r❦♦✈ ♠♦❞❡❧✳ + +l(A) + += + +log P (z; A) + += + +log + +T + +Azt−1 zt +t=1 + +T + +log Azt−1 zt + += +t=1 + +|S| |S| + +T + +1{zt−1 = si ∧ zt = sj } log Aij + += +i=1 j=1 t=1 + +■♥ t❤❡ ❧❛st ❧✐♥❡✱ ✇❡ ✉s❡ ❛♥ ✐♥❞✐❝❛t♦r ❢✉♥❝t✐♦♥ ✇❤♦s❡ ✈❛❧✉❡ ✐s ♦♥❡ ✇❤❡♥ t❤❡ +❝♦♥❞✐t✐♦♥ ❤♦❧❞s ❛♥❞ ③❡r♦ ♦t❤❡r✇✐s❡ t♦ s❡❧❡❝t t❤❡ ♦❜s❡r✈❡❞ tr❛♥s✐t✐♦♥ ❛t ❡❛❝❤ +t✐♠❡ st❡♣✳ + +❲❤❡♥ s♦❧✈✐♥❣ t❤✐s ♦♣t✐♠✐③❛t✐♦♥ ♣r♦❜❧❡♠✱ ✐t✬s ✐♠♣♦rt❛♥t t♦ ❡♥s✉r❡ + +t❤❛t s♦❧✈❡❞ ♣❛r❛♠❡t❡rs + +A + +st✐❧❧ ♠❛❦❡ ❛ ✈❛❧✐❞ tr❛♥s✐t✐♦♥ ♠❛tr✐①✳ ■♥ ♣❛rt✐❝✉❧❛r✱ ✇❡ + +♥❡❡❞ t♦ ❡♥❢♦r❝❡ t❤❛t t❤❡ ♦✉t❣♦✐♥❣ ♣r♦❜❛❜✐❧✐t② ❞✐str✐❜✉t✐♦♥ ❢r♦♠ st❛t❡ +s✉♠s t♦ ✶ ❛♥❞ ❛❧❧ ❡❧❡♠❡♥ts ♦❢ + +A ❛r❡ ♥♦♥✲♥❡❣❛t✐✈❡✳ + +♣r♦❜❧❡♠ ✉s✐♥❣ t❤❡ ♠❡t❤♦❞ ♦❢ ▲❛❣r❛♥❣❡ ♠✉❧t✐♣❧✐❡rs✳ + +max +A + +l(A) + +✸ + +i + +❛❧✇❛②s + +❲❡ ❝❛♥ s♦❧✈❡ t❤✐s ♦♣t✐♠✐③❛t✐♦♥ + + |S| + +s.t. + +Aij = 1, i = 1..|S| +j=1 + +Aij ≥ 0, i, j = 1..|S| +❚❤✐s ❝♦♥str❛✐♥❡❞ ♦♣t✐♠✐③❛t✐♦♥ ♣r♦❜❧❡♠ ❝❛♥ ❜❡ s♦❧✈❡❞ ✐♥ ❝❧♦s❡❞ ❢♦r♠ ✉s✐♥❣ t❤❡ +♠❡t❤♦❞ ♦❢ ▲❛❣r❛♥❣❡ ♠✉❧t✐♣❧✐❡rs✳ ❲❡✬❧❧ ✐♥tr♦❞✉❝❡ t❤❡ ❡q✉❛❧✐t② ❝♦♥str❛✐♥t ✐♥t♦ t❤❡ +▲❛❣r❛♥❣✐❛♥✱ ❜✉t t❤❡ ✐♥❡q✉❛❧✐t② ❝♦♥str❛✐♥t ❝❛♥ s❛❢❡❧② ❜❡ ✐❣♥♦r❡❞ ✖ t❤❡ ♦♣t✐♠❛❧ +s♦❧✉t✐♦♥ ✇✐❧❧ ♣r♦❞✉❝❡ ♣♦s✐t✐✈❡ ✈❛❧✉❡s ❢♦r + +Aij + +❛♥②✇❛②✳ + +❚❤❡r❡❢♦r❡ ✇❡ ❝♦♥str✉❝t + +t❤❡ ▲❛❣r❛♥❣✐❛♥ ❛s✿ + +|S| |S| + +L(A, α) + +|S| + +T + +1{zt−1 = si ∧ zt = sj } log Aij + + += +i=1 j=1 t=1 + +|S| + +αi (1 − +i=1 + +Aij ) +j=1 + +❚❛❦✐♥❣ ♣❛rt✐❛❧ ❞❡r✐✈❛t✐✈❡s ❛♥❞ s❡tt✐♥❣ t❤❡♠ ❡q✉❛❧ t♦ ③❡r♦ ✇❡ ❣❡t✿ + +∂L(A, α) +∂Aij + += + += + +1 +Aij + +⇒ +Aij + +|S| + +T + +∂ +∂ +1{zt−1 = si ∧ zt = sj } log Aij ) + +Aij ) +( +αi (1 − +∂Aij t=1 +∂Aij +j=1 + += + +1 +αi + +T + +1{zt−1 = si ∧ zt = sj } − αi ≡ 0 +t=1 +T + +1{zt−1 = si ∧ zt = sj } +t=1 + +❙✉❜st✐t✉t✐♥❣ ❜❛❝❦ ✐♥ ❛♥❞ s❡tt✐♥❣ t❤❡ ♣❛rt✐❛❧ ✇✐t❤ r❡s♣❡❝t t♦ + +∂L(A, β) +∂αi + +α + +❡q✉❛❧ t♦ ③❡r♦✿ + +|S| + += + +1− + +Aij +j=1 +|S| + += + +1− +j=1 + +1 +αi + +T + +1{zt−1 = si ∧ zt = sj } ≡ 0 +t=1 + +⇒ +|S| + +αi + +T + +1{zt−1 = si ∧ zt = sj } + += +j=1 t=1 +T + +1{zt−1 = si } + += +t=1 + +❙✉❜st✐t✉t✐♥❣ ✐♥ t❤✐s ✈❛❧✉❡ ❢♦r + +αi + +✐♥t♦ t❤❡ ❡①♣r❡ss✐♦♥ ✇❡ ❞❡r✐✈❡❞ ❢♦r + +♦❜t❛✐♥ ♦✉r ✜♥❛❧ ♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣❛r❛♠❡t❡r ✈❛❧✉❡ ❢♦r + +✹ + +Aˆij ✳ + +Aij + +✇❡ + + Aˆij + +T +t=1 + += + +1{zt−1 = si ∧ zt = sj } +T +t=1 + +1{zt−1 = si } + +❚❤✐s ❢♦r♠✉❧❛ ❡♥❝♦❞❡s ❛ s✐♠♣❧❡ ✐♥t✉✐t✐♦♥✿ t❤❡ ♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣r♦❜❛❜✐❧✐t② +♦❢ tr❛♥s✐t✐♦♥✐♥❣ ❢r♦♠ st❛t❡ +❢r♦♠ + +i t♦ j + +i + +t♦ st❛t❡ + +j + +✐s ❥✉st t❤❡ ♥✉♠❜❡r ♦❢ t✐♠❡s ✇❡ tr❛♥s✐t✐♦♥ + +❞✐✈✐❞❡❞ ❜② t❤❡ t♦t❛❧ ♥✉♠❜❡r ♦❢ t✐♠❡s ✇❡ ❛r❡ ✐♥ i✳ ■♥ ♦t❤❡r ✇♦r❞s✱ t❤❡ + +♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣❛r❛♠❡t❡r ❝♦rr❡s♣♦♥❞s t♦ t❤❡ ❢r❛❝t✐♦♥ ♦❢ t❤❡ t✐♠❡ ✇❤❡♥ ✇❡ +✇❡r❡ ✐♥ st❛t❡ + +✷ + +i + +t❤❛t ✇❡ tr❛♥s✐t✐♦♥❡❞ t♦ + +j✳ + +❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧s + +▼❛r❦♦✈ ▼♦❞❡❧s ❛r❡ ❛ ♣♦✇❡r❢✉❧ ❛❜str❛❝t✐♦♥ ❢♦r t✐♠❡ s❡r✐❡s ❞❛t❛✱ ❜✉t ❢❛✐❧ t♦ ❝❛♣✲ +t✉r❡ ❛ ✈❡r② ❝♦♠♠♦♥ s❝❡♥❛r✐♦✳ ❍♦✇ ❝❛♥ ✇❡ r❡❛s♦♥ ❛❜♦✉t ❛ s❡r✐❡s ♦❢ st❛t❡s ✐❢ ✇❡ +❝❛♥♥♦t ♦❜s❡r✈❡ t❤❡ st❛t❡s t❤❡♠s❡❧✈❡s✱ ❜✉t r❛t❤❡r ♦♥❧② s♦♠❡ ♣r♦❜❛❜✐❧✐st✐❝ ❢✉♥❝✲ +t✐♦♥ ♦❢ t❤♦s❡ st❛t❡s❄ ❚❤✐s ✐s t❤❡ s❝❡♥❛r✐♦ ❢♦r ♣❛rt✲♦❢✲s♣❡❡❝❤ t❛❣❣✐♥❣ ✇❤❡r❡ t❤❡ +✇♦r❞s ❛r❡ ♦❜s❡r✈❡❞ ❜✉t t❤❡ ♣❛rts✲♦❢✲s♣❡❡❝❤ t❛❣s ❛r❡♥✬t✱ ❛♥❞ ❢♦r s♣❡❡❝❤ r❡❝♦❣♥✐✲ +t✐♦♥ ✇❤❡r❡ t❤❡ s♦✉♥❞ s❡q✉❡♥❝❡ ✐s ♦❜s❡r✈❡❞ ❜✉t ♥♦t t❤❡ ✇♦r❞s t❤❛t ❣❡♥❡r❛t❡❞ ✐t✳ +❋♦r ❛ s✐♠♣❧❡ ❡①❛♠♣❧❡✱ ❧❡t✬s ❜♦rr♦✇ t❤❡ s❡t✉♣ ♣r♦♣♦s❡❞ ❜② ❏❛s♦♥ ❊✐s♥❡r ✐♥ ✷✵✵✷ +❬✶❪✱ ✏■❝❡ ❈r❡❛♠ ❈❧✐♠❛t♦❧♦❣②✳✑ +❚❤❡ s✐t✉❛t✐♦♥✿ ❨♦✉ ❛r❡ ❛ ❝❧✐♠❛t♦❧♦❣✐st ✐♥ t❤❡ ②❡❛r ✷✼✾✾✱ st✉❞②✐♥❣ +t❤❡ ❤✐st♦r② ♦❢ ❣❧♦❜❛❧ ✇❛r♠✐♥❣✳ ❨♦✉ ❝❛♥✬t ✜♥❞ ❛♥② r❡❝♦r❞s ♦❢ ❇❛❧t✐✲ +♠♦r❡ ✇❡❛t❤❡r✱ ❜✉t ②♦✉ ❞♦ ✜♥❞ ♠② ✭❏❛s♦♥ ❊✐s♥❡r✬s✮ ❞✐❛r②✱ ✐♥ ✇❤✐❝❤ ■ +❛ss✐❞✉♦✉s❧② r❡❝♦r❞❡❞ ❤♦✇ ♠✉❝❤ ✐❝❡ ❝r❡❛♠ ■ ❛t❡ ❡❛❝❤ ❞❛②✳ + +②♦✉ ✜❣✉r❡ ♦✉t ❢r♦♠ t❤✐s ❛❜♦✉t t❤❡ ✇❡❛t❤❡r t❤❛t s✉♠♠❡r❄ + +❲❤❛t ❝❛♥ + +❆ ❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧ ✭❍▼▼✮ ❝❛♥ ❜❡ ✉s❡❞ t♦ ❡①♣❧♦r❡ t❤✐s s❝❡♥❛r✐♦✳ + +❲❡ + +❞♦♥✬t ❣❡t t♦ ♦❜s❡r✈❡ t❤❡ ❛❝t✉❛❧ s❡q✉❡♥❝❡ ♦❢ st❛t❡s ✭t❤❡ ✇❡❛t❤❡r ♦♥ ❡❛❝❤ ❞❛②✮✳ +❘❛t❤❡r✱ ✇❡ ❝❛♥ ♦♥❧② ♦❜s❡r✈❡ s♦♠❡ ♦✉t❝♦♠❡ ❣❡♥❡r❛t❡❞ ❜② ❡❛❝❤ st❛t❡ ✭❤♦✇ ♠❛♥② +✐❝❡ ❝r❡❛♠s ✇❡r❡ ❡❛t❡♥ t❤❛t ❞❛②✮✳ +❋♦r♠❛❧❧②✱ ❛♥ ❍▼▼ ✐s ❛ ▼❛r❦♦✈ ♠♦❞❡❧ ❢♦r ✇❤✐❝❤ ✇❡ ❤❛✈❡ ❛ s❡r✐❡s ♦❢ + +♦❜s❡r✈❡❞ + +x = {x1 , x2 , ..., xT } ❞r❛✇♥ ❢r♦♠ ❛♥ ♦✉t♣✉t ❛❧♣❤❛❜❡t V = {v1 , v2 , ..., v|V | }✱ +✐✳❡✳ xt ∈ V, t = 1..T ✳ ❆s ✐♥ t❤❡ ♣r❡✈✐♦✉s s❡❝t✐♦♥✱ ✇❡ ❛❧s♦ ♣♦s✐t t❤❡ ❡①✐st❡♥❝❡ ♦❢ s❡✲ +r✐❡s ♦❢ st❛t❡s z = {z1 , z2 , ..., zT } ❞r❛✇♥ ❢r♦♠ ❛ st❛t❡ ❛❧♣❤❛❜❡t S = {s1 , s2 , ...s|S| }✱ +zt ∈ S, t = 1..T ❜✉t ✐♥ t❤✐s s❝❡♥❛r✐♦ t❤❡ ✈❛❧✉❡s ♦❢ t❤❡ st❛t❡s ❛r❡ ✉♥♦❜s❡r✈❡❞✳ ❚❤❡ +tr❛♥s✐t✐♦♥ ❜❡t✇❡❡♥ st❛t❡s i ❛♥❞ j ✇✐❧❧ ❛❣❛✐♥ ❜❡ r❡♣r❡s❡♥t❡❞ ❜② t❤❡ ❝♦rr❡s♣♦♥❞✐♥❣ +✈❛❧✉❡ ✐♥ ♦✉r st❛t❡ tr❛♥s✐t✐♦♥ ♠❛tr✐① Aij ✳ + +♦✉t♣✉ts + +❲❡ ❛❧s♦ ♠♦❞❡❧ t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ❣❡♥❡r❛t✐♥❣ ❛♥ ♦✉t♣✉t ♦❜s❡r✈❛t✐♦♥ ❛s ❛ +❢✉♥❝t✐♦♥ ♦❢ ♦✉r ❤✐❞❞❡♥ st❛t❡✳ ❚♦ ❞♦ s♦✱ ✇❡ ♠❛❦❡ t❤❡ ♦✉t♣✉t ✐♥❞❡♣❡♥❞❡♥❝❡ + +❛ss✉♠♣t✐♦♥ ❛♥❞ ❞❡✜♥❡ + +Bjk + +✳ + +♦✉t♣✉t + +❚❤❡ ♠❛tr✐① + +vk + +B + +P (xt = vk |zt = sj ) = P (xt = vk |x1 , ..., xT , z1 , ..., zT ) = + +❡♥❝♦❞❡s t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ♦✉r ❤✐❞❞❡♥ st❛t❡ ❣❡♥❡r❛t✐♥❣ + +❣✐✈❡♥ t❤❛t t❤❡ st❛t❡ ❛t t❤❡ ❝♦rr❡s♣♦♥❞✐♥❣ t✐♠❡ ✇❛s + +sj ✳ + +❘❡t✉r♥✐♥❣ t♦ t❤❡ ✇❡❛t❤❡r ❡①❛♠♣❧❡✱ ✐♠❛❣✐♥❡ t❤❛t ②♦✉ ❤❛✈❡ ❧♦❣s ♦❢ ✐❝❡ ❝r❡❛♠ +❝♦♥s✉♠♣t✐♦♥ ♦✈❡r ❛ ❢♦✉r ❞❛② ♣❡r✐♦❞✿ + +x = {x1 = v3 , x2 = v2 , x3 = v1 , x4 = v2 } +✺ + + ✇❤❡r❡ ♦✉r ❛❧♣❤❛❜❡t ❥✉st ❡♥❝♦❞❡s t❤❡ ♥✉♠❜❡r ♦❢ ✐❝❡ ❝r❡❛♠s ❝♦♥s✉♠❡❞✱ ✐✳❡✳ + +{v1 = 1 ice cream, v2 = 2 ice creams, v3 = 3 ice creams}✳ + +V = + +❲❤❛t q✉❡st✐♦♥s ❝❛♥ + +❛♥ ❍▼▼ ❧❡t ✉s ❛♥s✇❡r❄ + +✷✳✶ + +❚❤r❡❡ q✉❡st✐♦♥s ♦❢ ❛ ❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧ + +❚❤❡r❡ ❛r❡ t❤r❡❡ ❢✉♥❞❛♠❡♥t❛❧ q✉❡st✐♦♥s ✇❡ ♠✐❣❤t ❛s❦ ♦❢ ❛♥ ❍▼▼✳ ❲❤❛t ✐s t❤❡ +♣r♦❜❛❜✐❧✐t② ♦❢ ❛♥ ♦❜s❡r✈❡❞ s❡q✉❡♥❝❡ ✭❤♦✇ ❧✐❦❡❧② ✇❡r❡ ✇❡ t♦ s❡❡ + +3, 2, 1, 2 ✐❝❡ ❝r❡❛♠s + +❝♦♥s✉♠❡❞✮❄ ❲❤❛t ✐s t❤❡ ♠♦st ❧✐❦❡❧② s❡r✐❡s ♦❢ st❛t❡s t♦ ❣❡♥❡r❛t❡ t❤❡ ♦❜s❡r✈❛t✐♦♥s +✭✇❤❛t ✇❛s t❤❡ ✇❡❛t❤❡r ❢♦r t❤♦s❡ ❢♦✉r ❞❛②s✮❄ ❆♥❞ ❤♦✇ ❝❛♥ ✇❡ ❧❡❛r♥ ✈❛❧✉❡s ❢♦r +t❤❡ ❍▼▼✬s ♣❛r❛♠❡t❡rs + +✷✳✷ + +A + +B + +❛♥❞ + +❣✐✈❡♥ s♦♠❡ ❞❛t❛❄ + +Pr♦❜❛❜✐❧✐t② ♦❢ ❛♥ ♦❜s❡r✈❡❞ s❡q✉❡♥❝❡✿ ❋♦r✇❛r❞ ♣r♦❝❡✲ +❞✉r❡ + +■♥ ❛♥ ❍▼▼✱ ✇❡ ❛ss✉♠❡ t❤❛t ♦✉r ❞❛t❛ ✇❛s ❣❡♥❡r❛t❡❞ ❜② t❤❡ ❢♦❧❧♦✇✐♥❣ ♣r♦❝❡ss✿ +♣♦s✐t t❤❡ ❡①✐st❡♥❝❡ ♦❢ ❛ s❡r✐❡s ♦❢ st❛t❡s + +z + +♦✈❡r t❤❡ ❧❡♥❣t❤ ♦❢ ♦✉r t✐♠❡ s❡r✐❡s✳ + +❚❤✐s st❛t❡ s❡q✉❡♥❝❡ ✐s ❣❡♥❡r❛t❡❞ ❜② ❛ ▼❛r❦♦✈ ♠♦❞❡❧ ♣❛r❛♠❡tr✐③❡❞ ❜② ❛ st❛t❡ +tr❛♥s✐t✐♦♥ ♠❛tr✐① +t❤❡ st❛t❡ + +zt ✳ + +A✳ + +❆t ❡❛❝❤ t✐♠❡ st❡♣ t✱ ✇❡ s❡❧❡❝t ❛♥ ♦✉t♣✉t + +xt + +❛s ❛ ❢✉♥❝t✐♦♥ ♦❢ + +❚❤❡r❡❢♦r❡✱ t♦ ❣❡t t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ ❛ s❡q✉❡♥❝❡ ♦❢ ♦❜s❡r✈❛t✐♦♥s✱ ✇❡ + +♥❡❡❞ t♦ ❛❞❞ ✉♣ t❤❡ ❧✐❦❡❧✐❤♦♦❞ ♦❢ t❤❡ ❞❛t❛ + +P (x; A, B) + += + +x + +❣✐✈❡♥ ❡✈❡r② ♣♦ss✐❜❧❡ s❡r✐❡s ♦❢ st❛t❡s✳ + +P (x, z; A, B) +z + += + +P (x|z; A, B)P (z; A, B) +z + +❚❤❡ ❢♦r♠✉❧❛s ❛❜♦✈❡ ❛r❡ tr✉❡ ❢♦r ❛♥② ♣r♦❜❛❜✐❧✐t② ❞✐str✐❜✉t✐♦♥✳ ❍♦✇❡✈❡r✱ t❤❡ +❍▼▼ ❛ss✉♠♣t✐♦♥s ❛❧❧♦✇ ✉s t♦ s✐♠♣❧✐❢② t❤❡ ❡①♣r❡ss✐♦♥ ❢✉rt❤❡r✿ + +P (x; A, B) + += + +P (x|z; A, B)P (z; A, B) +z +T + += + +T + +P (xt |zt ; B)) ( + +( +t=1 + +z + +T + += + +( +z + +P (zt |zt−1 ; A)) + +t=1 +T + +Bzt xt ) ( +t=1 + +Azt−1 zt ) + +t=1 + +❚❤❡ ❣♦♦❞ ♥❡✇s ✐s t❤❛t t❤✐s ✐s ❛ s✐♠♣❧❡ ❡①♣r❡ss✐♦♥ ✐♥ t❡r♠s ♦❢ ♦✉r ♣❛r❛♠❡✲ +t❡rs✳ ❚❤❡ ❞❡r✐✈❛t✐♦♥ ❢♦❧❧♦✇s t❤❡ ❍▼▼ ❛ss✉♠♣t✐♦♥s✿ t❤❡ ♦✉t♣✉t ✐♥❞❡♣❡♥❞❡♥❝❡ +❛ss✉♠♣t✐♦♥✱ ▼❛r❦♦✈ ❛ss✉♠♣t✐♦♥✱ ❛♥❞ st❛t✐♦♥❛r② ♣r♦❝❡ss ❛ss✉♠♣t✐♦♥ ❛r❡ ❛❧❧ ✉s❡❞ +t♦ ❞❡r✐✈❡ t❤❡ s❡❝♦♥❞ ❧✐♥❡✳ ❚❤❡ ❜❛❞ ♥❡✇s ✐s t❤❛t t❤❡ s✉♠ ✐s ♦✈❡r ❡✈❡r② ♣♦ss✐❜❧❡ +❛ss✐❣♥♠❡♥t t♦ + +z✳ + +❇❡❝❛✉s❡ + +zt + +❝❛♥ t❛❦❡ ♦♥❡ ♦❢ + +st❡♣✱ ❡✈❛❧✉❛t✐♥❣ t❤✐s s✉♠ ❞✐r❡❝t❧② ✇✐❧❧ r❡q✉✐r❡ + +✻ + +|S| ♣♦ss✐❜❧❡ ✈❛❧✉❡s ❛t +O(|S|T ) ♦♣❡r❛t✐♦♥s✳ + +❡❛❝❤ t✐♠❡ + + ❆❧❣♦r✐t❤♠ ✶ ❋♦r✇❛r❞ Pr♦❝❡❞✉r❡ ❢♦r ❝♦♠♣✉t✐♥❣ + +✶✳ ❇❛s❡ ❝❛s❡✿ +✷✳ ❘❡❝✉rs✐♦♥✿ + +αi (t) + +αi (0) = A0 i , i = 1..|S| +|S| +αj (t) = i=1 αi (t − 1)Aij Bj xt , j = 1..|S|, t = 1..T + +❋♦rt✉♥❛t❡❧②✱ ❛ ❢❛st❡r ♠❡❛♥s ♦❢ ❝♦♠♣✉t✐♥❣ + +P (x; A, B) + +✐s ♣♦ss✐❜❧❡ ✈✐❛ ❛ ❞②✲ + +♥❛♠✐❝ ♣r♦❣r❛♠♠✐♥❣ ❛❧❣♦r✐t❤♠ ❝❛❧❧❡❞ t❤❡ ❋♦r✇❛r❞ Pr♦❝❡❞✉r❡✳ + +❋✐rst✱ ❧❡t✬s + +αi (t) = P (x1 , x2 , ..., xt , zt = si ; A, B)✳ αi (t) r❡♣r❡s❡♥ts t❤❡ +t♦t❛❧ ♣r♦❜❛❜✐❧✐t② ♦❢ ❛❧❧ t❤❡ ♦❜s❡r✈❛t✐♦♥s ✉♣ t❤r♦✉❣❤ t✐♠❡ t ✭❜② ❛♥② st❛t❡ ❛ss✐❣♥✲ +♠❡♥t✮ ❛♥❞ t❤❛t ✇❡ ❛r❡ ✐♥ st❛t❡ si ❛t t✐♠❡ t✳ ■❢ ✇❡ ❤❛❞ s✉❝❤ ❛ q✉❛♥t✐t②✱ t❤❡ +♣r♦❜❛❜✐❧✐t② ♦❢ ♦✉r ❢✉❧❧ s❡t ♦❢ ♦❜s❡r✈❛t✐♦♥s P (x) ❝♦✉❧❞ ❜❡ r❡♣r❡s❡♥t❡❞ ❛s✿ + +❞❡✜♥❡ ❛ q✉❛♥t✐t② + +P (x; A, B) + += + +P (x1 , x2 , ..., xT ; A, B) +|S| + += + +P (x1 , x2 , ..., xT , zT = si ; A, B) +i=1 +|S| + += + +αi (T ) +i=1 + +❆❧❣♦r✐t❤♠ ✷✳✷ ♣r❡s❡♥ts ❛♥ ❡✣❝✐❡♥t ✇❛② t♦ ❝♦♠♣✉t❡ +✇❡ ♠✉st ❞♦ ♦♥❧② + +O(|S| · T ) +P (x; A, B)✳ +♦❢ + +O(|S|) + +αi (t)✳ + +❆t ❡❛❝❤ t✐♠❡ st❡♣ + +♦♣❡r❛t✐♦♥s✱ r❡s✉❧t✐♥❣ ✐♥ ❛ ✜♥❛❧ ❛❧❣♦r✐t❤♠ ❝♦♠♣❧❡①✐t② + +t♦ ❝♦♠♣✉t❡ t❤❡ t♦t❛❧ ♣r♦❜❛❜✐❧✐t② ♦❢ ❛♥ ♦❜s❡r✈❡❞ st❛t❡ s❡q✉❡♥❝❡ + +❆ s✐♠✐❧❛r ❛❧❣♦r✐t❤♠ ❦♥♦✇♥ ❛s t❤❡ ❇❛❝❦✇❛r❞ Pr♦❝❡❞✉r❡ ❝❛♥ ❜❡ ✉s❡❞ t♦ +❝♦♠♣✉t❡ ❛♥ ❛♥❛❧♦❣♦✉s ♣r♦❜❛❜✐❧✐t② + +✷✳✸ + +βi (t) = P (xT , xT −1 , .., xt+1 , zt = si ; A, B)✳ + +▼❛①✐♠✉♠ ▲✐❦❡❧✐❤♦♦❞ ❙t❛t❡ ❆ss✐❣♥♠❡♥t✿ ❚❤❡ ❱✐t❡r❜✐ +❆❧❣♦r✐t❤♠ + +❖♥❡ ♦❢ t❤❡ ♠♦st ❝♦♠♠♦♥ q✉❡r✐❡s ♦❢ ❛ ❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧ ✐s t♦ ❛s❦ ✇❤❛t +✇❛s t❤❡ ♠♦st ❧✐❦❡❧② s❡r✐❡s ♦❢ st❛t❡s + +x∈V + +T + +z ∈ ST + +❣✐✈❡♥ ❛♥ ♦❜s❡r✈❡❞ s❡r✐❡s ♦❢ ♦✉t♣✉ts + +✳ ❋♦r♠❛❧❧②✱ ✇❡ s❡❡❦✿ + +arg max P (z|x; A, B) = arg max +z + +z + +P (x, z; A, B) += arg max P (x, z; A, B) +z +z P (x, z; A, B) + +❚❤❡ ✜rst s✐♠♣❧✐✜❝❛t✐♦♥ ❢♦❧❧♦✇s ❢r♦♠ ❇❛②❡s r✉❧❡ ❛♥❞ t❤❡ s❡❝♦♥❞ ❢r♦♠ t❤❡ +♦❜s❡r✈❛t✐♦♥ t❤❛t t❤❡ ❞❡♥♦♠✐♥❛t♦r ❞♦❡s ♥♦t ❞✐r❡❝t❧② ❞❡♣❡♥❞ ♦♥ +♠✐❣❤t tr② ❡✈❡r② ♣♦ss✐❜❧❡ ❛ss✐❣♥♠❡♥t t♦ + +z + +z✳ + +◆❛✐✈❡❧②✱ ✇❡ + +❛♥❞ t❛❦❡ t❤❡ ♦♥❡ ✇✐t❤ t❤❡ ❤✐❣❤❡st + +❥♦✐♥t ♣r♦❜❛❜✐❧✐t② ❛ss✐❣♥❡❞ ❜② ♦✉r ♠♦❞❡❧✳ ❍♦✇❡✈❡r✱ t❤✐s ✇♦✉❧❞ r❡q✉✐r❡ + +O(|S|T ) + +♦♣❡r❛t✐♦♥s ❥✉st t♦ ❡♥✉♠❡r❛t❡ t❤❡ s❡t ♦❢ ♣♦ss✐❜❧❡ ❛ss✐❣♥♠❡♥ts✳ ❆t t❤✐s ♣♦✐♥t✱ ②♦✉ +♠✐❣❤t t❤✐♥❦ ❛ ❞②♥❛♠✐❝ ♣r♦❣r❛♠♠✐♥❣ s♦❧✉t✐♦♥ ❧✐❦❡ t❤❡ ❋♦r✇❛r❞ ❆❧❣♦r✐t❤♠ ♠✐❣❤t + +arg maxz ✇✐t❤ +z ✱ ♦✉r ❝✉rr❡♥t t❛s❦ ✐s ❡①❛❝t❧② ❛♥❛❧♦❣♦✉s t♦ t❤❡ ❡①♣r❡ss✐♦♥ ✇❤✐❝❤ ♠♦t✐✈❛t❡❞ +t❤❡ ❢♦r✇❛r❞ ♣r♦❝❡❞✉r❡✳ +s❛✈❡ t❤❡ ❞❛②✱ ❛♥❞ ②♦✉✬❞ ❜❡ r✐❣❤t✳ ◆♦t✐❝❡ t❤❛t ✐❢ ②♦✉ r❡♣❧❛❝❡❞ t❤❡ + +✼ + + ❆❧❣♦r✐t❤♠ ✷ ◆❛✐✈❡ ❛♣♣❧✐❝❛t✐♦♥ ♦❢ ❊▼ t♦ ❍▼▼s + +❘❡♣❡❛t ✉♥t✐❧ ❝♦♥✈❡r❣❡♥❝❡ ④ +✭❊✲❙t❡♣✮ ❋♦r ❡✈❡r② ♣♦ss✐❜❧❡ ❧❛❜❡❧✐♥❣ + +Q(z) + +z ∈ ST ✱ + +s❡t + +:= p(z|x; A, B) + +✭▼✲❙t❡♣✮ ❙❡t + +A, B + +:= + +arg max +A,B + +Q(z) log +z + +P (x, z; A, B) +Q(z) + +|S| + +Aij = 1, i = 1..|S|; Aij ≥ 0, i, j = 1..|S| + +s.t. +j=1 +|V | + +Bik = 1, i = 1..|S|; Bik ≥ 0, i = 1..|S|, k = 1..|V | +k=1 +⑥ + +❚❤❡ ❱✐t❡r❜✐ ❆❧❣♦r✐t❤♠ ✐s ❥✉st ❧✐❦❡ t❤❡ ❢♦r✇❛r❞ ♣r♦❝❡❞✉r❡ ❡①❝❡♣t t❤❛t +✐♥st❡❛❞ ♦❢ tr❛❝❦✐♥❣ t❤❡ t♦t❛❧ ♣r♦❜❛❜✐❧✐t② ♦❢ ❣❡♥❡r❛t✐♥❣ t❤❡ ♦❜s❡r✈❛t✐♦♥s s❡❡♥ s♦ +❢❛r✱ ✇❡ ♥❡❡❞ ♦♥❧② tr❛❝❦ t❤❡ + +♠❛①✐♠✉♠ + +♣r♦❜❛❜✐❧✐t② ❛♥❞ r❡❝♦r❞ ✐ts ❝♦rr❡s♣♦♥❞✐♥❣ + +st❛t❡ s❡q✉❡♥❝❡✳ + +✷✳✹ + +P❛r❛♠❡t❡r ▲❡❛r♥✐♥❣✿ ❊▼ ❢♦r ❍▼▼s + +❚❤❡ ✜♥❛❧ q✉❡st✐♦♥ t♦ ❛s❦ ♦❢ ❛♥ ❍▼▼ ✐s✿ + +❣✐✈❡♥ ❛ s❡t ♦❢ ♦❜s❡r✈❛t✐♦♥s✱ ✇❤❛t + +❛r❡ t❤❡ ✈❛❧✉❡s ♦❢ t❤❡ st❛t❡ tr❛♥s✐t✐♦♥ ♣r♦❜❛❜✐❧✐t✐❡s +♣r♦❜❛❜✐❧✐t✐❡s + +B + +A + +❛♥❞ t❤❡ ♦✉t♣✉t ❡♠✐ss✐♦♥ + +t❤❛t ♠❛❦❡ t❤❡ ❞❛t❛ ♠♦st ❧✐❦❡❧②❄ ❋♦r ❡①❛♠♣❧❡✱ s♦❧✈✐♥❣ ❢♦r t❤❡ + +♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣❛r❛♠❡t❡rs ❜❛s❡❞ ♦♥ ❛ s♣❡❡❝❤ r❡❝♦❣♥✐t✐♦♥ ❞❛t❛s❡t ✇✐❧❧ ❛❧❧♦✇ +✉s t♦ ❡✛❡❝t✐✈❡❧② tr❛✐♥ t❤❡ ❍▼▼ ❜❡❢♦r❡ ❛s❦✐♥❣ ❢♦r t❤❡ ♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ st❛t❡ +❛ss✐❣♥♠❡♥t ♦❢ ❛ ❝❛♥❞✐❞❛t❡ s♣❡❡❝❤ s✐❣♥❛❧✳ +■♥ t❤✐s s❡❝t✐♦♥✱ ✇❡ ♣r❡s❡♥t ❛ ❞❡r✐✈❛t✐♦♥ ♦❢ t❤❡ ❊①♣❡❝t❛t✐♦♥ ▼❛①✐♠✐③❛t✐♦♥ +❛❧❣♦r✐t❤♠ ❢♦r ❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧s✳ ❚❤✐s ♣r♦♦❢ ❢♦❧❧♦✇s ❢r♦♠ t❤❡ ❣❡♥❡r❛❧ ❢♦r✲ +♠✉❧❛t✐♦♥ ♦❢ ❊▼ ♣r❡s❡♥t❡❞ ✐♥ t❤❡ ❈❙✷✷✾ ❧❡❝t✉r❡ ♥♦t❡s✳ ❆❧❣♦r✐t❤♠ ✷✳✹ s❤♦✇s t❤❡ +❜❛s✐❝ ❊▼ ❛❧❣♦r✐t❤♠✳ ◆♦t✐❝❡ t❤❛t t❤❡ ♦♣t✐♠✐③❛t✐♦♥ ♣r♦❜❧❡♠ ✐♥ t❤❡ ▼✲❙t❡♣ ✐s ♥♦✇ +❝♦♥str❛✐♥❡❞ s✉❝❤ t❤❛t + +A + +❛♥❞ + +B + +❝♦♥t❛✐♥ ✈❛❧✐❞ ♣r♦❜❛❜✐❧✐t✐❡s✳ ▲✐❦❡ t❤❡ ♠❛①✐♠✉♠ + +❧✐❦❡❧✐❤♦♦❞ s♦❧✉t✐♦♥ ✇❡ ❢♦✉♥❞ ❢♦r ✭♥♦♥✲❍✐❞❞❡♥✮ ▼❛r❦♦✈ ♠♦❞❡❧s✱ ✇❡✬❧❧ ❜❡ ❛❜❧❡ t♦ +s♦❧✈❡ t❤✐s ♦♣t✐♠✐③❛t✐♦♥ ♣r♦❜❧❡♠ ✇✐t❤ ▲❛❣r❛♥❣❡ ♠✉❧t✐♣❧✐❡rs✳ ◆♦t✐❝❡ ❛❧s♦ t❤❛t t❤❡ +❊✲❙t❡♣ ❛♥❞ ▼✲❙t❡♣ ❜♦t❤ r❡q✉✐r❡ ❡♥✉♠❡r❛t✐♥❣ ❛❧❧ + +|S|T + +♣♦ss✐❜❧❡ ❧❛❜❡❧❧✐♥❣s ♦❢ + +z✳ + +❲❡✬❧❧ ♠❛❦❡ ✉s❡ ♦❢ t❤❡ ❋♦r✇❛r❞ ❛♥❞ ❇❛❝❦✇❛r❞ ❛❧❣♦r✐t❤♠s ♠❡♥t✐♦♥❡❞ ❡❛r❧✐❡r t♦ +❝♦♠♣✉t❡ ❛ s❡t ♦❢ s✉✣❝✐❡♥t st❛t✐st✐❝s ❢♦r ♦✉r ❊✲❙t❡♣ ❛♥❞ ▼✲❙t❡♣ tr❛❝t❛❜❧②✳ +❋✐rst✱ ❧❡t✬s r❡✇r✐t❡ t❤❡ ♦❜❥❡❝t✐✈❡ ❢✉♥❝t✐♦♥ ✉s✐♥❣ ♦✉r ▼❛r❦♦✈ ❛ss✉♠♣t✐♦♥s✳ + +✽ + + A, B + += += + +arg max + +P (x, z; A, B) +Q(z) + +Q(z) log + +A,B + +z + +arg max + +Q(z) log P (x, z; A, B) + +A,B + +z +T + += + +arg max + +T + +P (xt |zt ; B)) ( + +Q(z) log( + +A,B + +t=1 + +z + +P (zt |zt−1 ; A)) + +t=1 + +T + += + +arg max + +Q(z) + +A,B + +log Bzt xt + log Azt−1 zt +t=1 + +z + +|S| |S| |V | + += + +arg max + +T + +1{zt = sj ∧ xt = vk } log Bjk + 1{zt−1 = si ∧ zt = sj } log Aij + +Q(z) + +A,B + +i=1 j=1 k=1 t=1 + +z + +■♥ t❤❡ ✜rst ❧✐♥❡ ✇❡ s♣❧✐t t❤❡ ❧♦❣ ❞✐✈✐s✐♦♥ ✐♥t♦ ❛ s✉❜tr❛❝t✐♦♥ ❛♥❞ ♥♦t❡ t❤❛t +t❤❡ ❞❡♥♦♠✐♥❛t♦r✬s t❡r♠ ❞♦❡s ♥♦t ❞❡♣❡♥❞ ♦♥ t❤❡ ♣❛r❛♠❡t❡rs + +A, B ✳ + +❚❤❡ ▼❛r❦♦✈ + +❛ss✉♠♣t✐♦♥s ❛r❡ ❛♣♣❧✐❡❞ ✐♥ ❧✐♥❡ ✸✳ ▲✐♥❡ ✺ ✉s❡s ✐♥❞✐❝❛t♦r ❢✉♥❝t✐♦♥s t♦ ✐♥❞❡① +❛♥❞ + +B + +A + +❜② st❛t❡✳ + +❏✉st ❛s ❢♦r t❤❡ ♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣❛r❛♠❡t❡rs ❢♦r ❛ ✈✐s✐❜❧❡ ▼❛r❦♦✈ ♠♦❞❡❧✱ +✐t ✐s s❛❢❡ t♦ ✐❣♥♦r❡ t❤❡ ✐♥❡q✉❛❧✐t② ❝♦♥str❛✐♥ts ❜❡❝❛✉s❡ t❤❡ s♦❧✉t✐♦♥ ❢♦r♠ ♥❛t✉r❛❧❧② +r❡s✉❧ts ✐♥ ♦♥❧② ♣♦s✐t✐✈❡ s♦❧✉t✐♦♥s✳ ❈♦♥str✉❝t✐♥❣ t❤❡ ▲❛❣r❛♥❣✐❛♥✿ + +|S| |S| |V | + +L(A, B, δ, ) + += + +T + +1{zt = sj ∧ xt = vk } log Bjk + 1{zt−1 = si ∧ zt = sj } log Aij + +Q(z) +i=1 j=1 k=1 t=1 + +z +|S| + ++ + +|V | +j (1 + +− + +|S| + +j=1 + +i=1 + +k=1 + +|S| + +δi (1 − + +Bjk ) + + +Aij ) +j=1 + +❚❛❦✐♥❣ ♣❛rt✐❛❧ ❞❡r✐✈❛t✐✈❡s ❛♥❞ s❡tt✐♥❣ t❤❡♠ ❡q✉❛❧ t♦ ③❡r♦✿ + +∂L(A, B, δ, ) +∂Aij +Aij + += + +Q(z) +z + += + +∂L(A, B, δ, ) +∂Bjk + += + +Bjk + += + +1 +δi + +1 +Aij + +1{zt−1 = si ∧ zt = sj } − δi ≡ 0 +t=1 +T + +1{zt−1 = si ∧ zt = sj } + +Q(z) +t=1 + +z + +Q(z) +z + +1 +Bjk + +T + +1{zt = sj ∧ xt = vk } − +t=1 +T + +1 +j + +T + +1{zt = sj ∧ xt = vk } + +Q(z) +z + +t=1 + +✾ + +j + +≡0 + + ❚❛❦✐♥❣ ♣❛rt✐❛❧ ❞❡r✐✈❛t✐✈❡s ✇✐t❤ r❡s♣❡❝t t♦ t❤❡ ▲❛❣r❛♥❣❡ ♠✉❧t✐♣❧✐❡rs ❛♥❞ s✉❜✲ +st✐t✉t✐♥❣ ♦✉r ✈❛❧✉❡s ♦❢ + +Aij + +❛♥❞ + +Bjk + +❛❜♦✈❡✿ + +|S| + +∂L(A, B, δ, ) +∂δi + += + +1− + +Aij +j=1 +|S| + += + +T + +1 +δi + +1− +j=1 + +t=1 + +z + +|S| + +δi + +1{zt−1 = si ∧ zt = sj } ≡ 0 + +Q(z) +T + +1{zt−1 = si ∧ zt = sj } + +Q(z) + += +j=1 + +t=1 + +z +T + +t=1 + +z + +∂L(A, B, δ, ) +∂ j + +1{zt−1 = si } + +Q(z) + += + +|V | + += + +1− + +Bjk +k=1 +|V | + += + +1− + +T + +1 +j + +k=1 + +t=1 + +z + +|V | +j + +1{zt = sj ∧ xt = vk } ≡ 0 + +Q(z) +T + += + +1{zt = sj ∧ xt = vk } + +Q(z) +t=1 + +k=1 z +T + += + +1{zt = sj } + +Q(z) +t=1 + +z + +❙✉❜st✐t✉t✐♥❣ ❜❛❝❦ ✐♥t♦ ♦✉r ❡①♣r❡ss✐♦♥s ❛❜♦✈❡✱ ✇❡ ✜♥❞ t❤❛t ♣❛r❛♠❡t❡rs + +ˆ +B + +Aˆ ❛♥❞ + +t❤❛t ♠❛①✐♠✐③❡ ♦✉r ♣r❡❞✐❝t❡❞ ❝♦✉♥ts ✇✐t❤ r❡s♣❡❝t t♦ t❤❡ ❞❛t❛s❡t ❛r❡✿ + +Aˆij +ˆjk +B + +z + += + +Q(z) + +T +t=1 + +z Q(z) +z + += + +Q(z) +z + +T +t=1 + +Q(z) + +1{zt−1 = si ∧ zt = sj } +T +t=1 + +1{zt−1 = si } + +1{zt = sj ∧ xt = vk } +T +t=1 + +1{zt = sj } + +z ∈ S T ✳ ❇✉t +r❡❝❛❧❧ t❤❛t Q(z) ✇❛s ❞❡✜♥❡❞ ✐♥ t❤❡ ❊✲st❡♣ ❛s P (z|x; A, B) ❢♦r ♣❛r❛♠❡t❡rs A ❛♥❞ +B ❛t t❤❡ ❧❛st t✐♠❡ st❡♣✳ ▲❡t✬s ❝♦♥s✐❞❡r ❤♦✇ t♦ r❡♣r❡s❡♥t ✜rst t❤❡ ♥✉♠❡r❛t♦r ♦❢ +Aˆij ✐♥ t❡r♠s ♦❢ ♦✉r ❢♦r✇❛r❞ ❛♥❞ ❜❛❝❦✇❛r❞ ♣r♦❜❛❜✐❧✐t✐❡s✱ αi (t) ❛♥❞ βj (t)✳ +❯♥❢♦rt✉♥❛t❡❧②✱ ❡❛❝❤ ♦❢ t❤❡s❡ s✉♠s ✐s ♦✈❡r ❛❧❧ ♣♦ss✐❜❧❡ ❧❛❜❡❧❧✐♥❣s + +T + +1{zt−1 = si ∧ zt = sj } + +Q(z) +z + +t=1 + +✶✵ + + T + +1{zt−1 = si ∧ zt = sj }Q(z) + += +t=1 + +z + +T + +1{zt−1 = si ∧ zt = sj }P (z|x; A, B) + += +t=1 + += + += + +z +T + +1 +P (x; A, B) + +1{zt−1 = si ∧ zt = sj }P (z, x; A, B) +t=1 + +z + +T + +1 +P (x; A, B) + +αi (t)Aij Bj xt βj (t + 1) +t=1 + +■♥ t❤❡ ✜rst t✇♦ st❡♣s ✇❡ r❡❛rr❛♥❣❡ t❡r♠s ❛♥❞ s✉❜st✐t✉t❡ ✐♥ ❢♦r ♦✉r ❞❡✜♥✐t✐♦♥ +♦❢ +♦❢ + +Q✳ ❚❤❡♥ ✇❡ ✉s❡ +α✱ β ✱ A✱ ❛♥❞ B ✱ + +❇❛②❡s r✉❧❡ ✐♥ ❞❡r✐✈✐♥❣ ❧✐♥❡ ❢♦✉r✱ ❢♦❧❧♦✇❡❞ ❜② t❤❡ ❞❡✜♥✐t✐♦♥s +✐♥ ❧✐♥❡ ✜✈❡✳ ❙✐♠✐❧❛r❧②✱ t❤❡ ❞❡♥♦♠✐♥❛t♦r ❝❛♥ ❜❡ r❡♣r❡s❡♥t❡❞ + +❜② s✉♠♠✐♥❣ ♦✉t ♦✈❡r + +j + +t❤❡ ✈❛❧✉❡ ♦❢ t❤❡ ♥✉♠❡r❛t♦r✳ + +T + +1{zt−1 = si } + +Q(z) +t=1 + +z +|S| + +T + += + +1{zt−1 = si ∧ zt = sj } + +Q(z) +j=1 + +t=1 + +z + +1 +P (x; A, B) + += + +|S| + +T + +αi (t)Aij Bj xt βj (t + 1) +j=1 t=1 + +❈♦♠❜✐♥✐♥❣ t❤❡s❡ ❡①♣r❡ss✐♦♥s✱ ✇❡ ❝❛♥ ❢✉❧❧② ❝❤❛r❛❝t❡r✐③❡ ♦✉r ♠❛①✐♠✉♠ ❧✐❦❡❧✐✲ +❤♦♦❞ st❛t❡ tr❛♥s✐t✐♦♥s + +Aˆij + +✇✐t❤♦✉t ♥❡❡❞✐♥❣ t♦ ❡♥✉♠❡r❛t❡ ❛❧❧ ♣♦ss✐❜❧❡ ❧❛❜❡❧❧✐♥❣s + +❛s✿ + +Aˆij + +T +t=1 + += + +|S| +j=1 + +αi (t)Aij Bj xt βj (t + 1) +T +t=1 αi (t)Aij Bj xt βj (t + + +❙✐♠✐❧❛r❧②✱ ✇❡ ❝❛♥ r❡♣r❡s❡♥t t❤❡ ♥✉♠❡r❛t♦r ❢♦r + +ˆjk +B + +1) + +❛s✿ + +T + +1{zt = sj ∧ xt = vk } + +Q(z) +z + += + += + +t=1 + +1 +P (x; A, B) +1 +P (x; A, B) + +T + +1{zt = sj ∧ xt = vk }P (z, x; A, B) +t=1 + +z + +|S| + +T + +1{zt−1 = si ∧ zt = sj ∧ xt = vk }P (z, x; A, B) +i=1 t=1 + +z + +✶✶ + + ❆❧❣♦r✐t❤♠ ✸ ❋♦r✇❛r❞✲❇❛❝❦✇❛r❞ ❛❧❣♦r✐t❤♠ ❢♦r ❍▼▼ ♣❛r❛♠❡t❡r ❧❡❛r♥✐♥❣ + +A ❛♥❞ B ❛s r❛♥❞♦♠ ✈❛❧✐❞ ♣r♦❜❛❜✐❧✐t② ♠❛tr✐❝❡s +Ai0 = 0 ❛♥❞ B0k = 0 ❢♦r i = 1..|S| ❛♥❞ k = 1..|V |✳ + +■♥✐t✐❛❧✐③❛t✐♦♥✿ ❙❡t +✇❤❡r❡ + +❘❡♣❡❛t ✉♥t✐❧ ❝♦♥✈❡r❣❡♥❝❡ ④ +✭❊✲❙t❡♣✮ ❘✉♥ t❤❡ ❋♦r✇❛r❞ ❛♥❞ ❇❛❝❦✇❛r❞ ❛❧❣♦r✐t❤♠s t♦ ❝♦♠♣✉t❡ + +i = 1..|S|✳ + +αi + +❛♥❞ + +βi + +❢♦r + +❚❤❡♥ s❡t✿ + +γt (i, j) + +:= αi (t)Aij Bj xt βj (t + 1) + +✭▼✲❙t❡♣✮ ❘❡✲❡st✐♠❛t❡ t❤❡ ♠❛①✐♠✉♠ ❧✐❦❡❧✐❤♦♦❞ ♣❛r❛♠❡t❡rs ❛s✿ + +Aij + +:= + +Bjk + +:= + +T +t=1 +|S| +j=1 +|S| +i=1 + +γt (i, j) +T +t=1 γt (i, j) + +T +t=1 +|S| +i=1 + +1{xt = vk } γt (i, j) +T +t=1 + +γt (i, j) + +⑥ + += + +1 +P (x; A, B) + +|S| + +T + +1{xt = vk }αi (t)Aij Bj xt βj (t + 1) +i=1 t=1 + +❆♥❞ t❤❡ ❞❡♥♦♠✐♥❛t♦r ♦❢ + +ˆjk +B + +❛s✿ + +T + +1{zt = sj } + +Q(z) +t=1 + +z + += + += + +1 +P (x; A, B) +1 +P (x; A, B) + +|S| + +T + +1{zt−1 = si ∧ zt = sj }P (z, x; A, B) +i=1 t=1 +|S| + +z + +T + +αi (t)Aij Bj xt βj (t + 1) +i=1 t=1 + +❈♦♠❜✐♥✐♥❣ t❤❡s❡ ❡①♣r❡ss✐♦♥s✱ ✇❡ ❤❛✈❡ t❤❡ ❢♦❧❧♦✇✐♥❣ ❢♦r♠ ❢♦r ♦✉r ♠❛①✐♠✉♠ +❧✐❦❡❧✐❤♦♦❞ ❡♠✐ss✐♦♥ ♣r♦❜❛❜✐❧✐t✐❡s ❛s✿ + +ˆjk +B + += + +|S| +i=1 + +T +t=1 +|S| +i=1 + +1{xt = vk }αi (t)Aij Bj xt βj (t + 1) +T +t=1 + +αi (t)Aij Bj xt βj (t + 1) + +❆❧❣♦r✐t❤♠ ✷✳✹ s❤♦✇s ❛ ✈❛r✐❛♥t ♦❢ t❤❡ ❋♦r✇❛r❞✲❇❛❝❦✇❛r❞ ❆❧❣♦r✐t❤♠✱ +♦r t❤❡ ❇❛✉♠✲❲❡❧❝❤ ❆❧❣♦r✐t❤♠ ❢♦r ♣❛r❛♠❡t❡r ❧❡❛r♥✐♥❣ ✐♥ ❍▼▼s✳ + +✶✷ + +■♥ t❤❡ + + Q(z) ❢♦r ❛❧❧ z ∈ S T ✱ ✇❡ ❝♦♠♣✉t❡ +❛ s✉✣❝✐❡♥t st❛t✐st✐❝s γt (i, j) = αi (t)Aij Bj xt βj (t + 1) t❤❛t ✐s ♣r♦♣♦rt✐♦♥❛❧ t♦ +t❤❡ ♣r♦❜❛❜✐❧✐t② ♦❢ tr❛♥s✐t✐♦♥✐♥❣ ❜❡t✇❡❡♥ s❛t❡ si ❛♥❞ sj ❛t t✐♠❡ t ❣✐✈❡♥ ❛❧❧ ♦❢ +♦✉r ♦❜s❡r✈❛t✐♦♥s x✳ ❚❤❡ ❞❡r✐✈❡❞ ❡①♣r❡ss✐♦♥s ❢♦r Aij ❛♥❞ Bjk ❛r❡ ✐♥t✉✐t✐✈❡❧② +❛♣♣❡❛❧✐♥❣✳ Aij ✐s ❝♦♠♣✉t❡❞ ❛s t❤❡ ❡①♣❡❝t❡❞ ♥✉♠❜❡r ♦❢ tr❛♥s✐t✐♦♥s ❢r♦♠ si t♦ +sj ❞✐✈✐❞❡❞ ❜② t❤❡ ❡①♣❡❝t❡❞ ♥✉♠❜❡r ♦❢ ❛♣♣❡❛r❛♥❝❡s ♦❢ si ✳ ❙✐♠✐❧❛r❧②✱ Bjk ✐s +❝♦♠♣✉t❡❞ ❛s t❤❡ ❡①♣❡❝t❡❞ ♥✉♠❜❡r ♦❢ ❡♠✐ss✐♦♥s ♦❢ vk ❢r♦♠ sj ❞✐✈✐❞❡❞ ❜② t❤❡ +❡①♣❡❝t❡❞ ♥✉♠❜❡r ♦❢ ❛♣♣❡❛r❛♥❝❡s ♦❢ sj ✳ +❊✲❙t❡♣✱ r❛t❤❡r t❤❛♥ ❡①♣❧✐❝✐t❧② ❡✈❛❧✉❛t✐♥❣ + +▲✐❦❡ ♠❛♥② ❛♣♣❧✐❝❛t✐♦♥s ♦❢ ❊▼✱ ♣❛r❛♠❡t❡r ❧❡❛r♥✐♥❣ ❢♦r ❍▼▼s ✐s ❛ ♥♦♥✲❝♦♥✈❡① +♣r♦❜❧❡♠ ✇✐t❤ ♠❛♥② ❧♦❝❛❧ ♠❛①✐♠❛✳ ❊▼ ✇✐❧❧ ❝♦♥✈❡r❣❡ t♦ ❛ ♠❛①✐♠✉♠ ❜❛s❡❞ ♦♥ +✐ts ✐♥✐t✐❛❧ ♣❛r❛♠❡t❡rs✱ s♦ ♠✉❧t✐♣❧❡ r✉♥s ♠✐❣❤t ❜❡ ✐♥ ♦r❞❡r✳ + +❆❧s♦✱ ✐t ✐s ♦❢t❡♥ + +✐♠♣♦rt❛♥t t♦ s♠♦♦t❤ t❤❡ ♣r♦❜❛❜✐❧✐t② ❞✐str✐❜✉t✐♦♥s r❡♣r❡s❡♥t❡❞ ❜② + +A + +❛♥❞ + +B + +s♦ + +t❤❛t ♥♦ tr❛♥s✐t✐♦♥ ♦r ❡♠✐ss✐♦♥ ✐s ❛ss✐❣♥❡❞ ✵ ♣r♦❜❛❜✐❧✐t②✳ + +✷✳✺ + +❋✉rt❤❡r r❡❛❞✐♥❣ + +❚❤❡r❡ ❛r❡ ♠❛♥② ❣♦♦❞ s♦✉r❝❡s ❢♦r ❧❡❛r♥✐♥❣ ❛❜♦✉t ❍✐❞❞❡♥ ▼❛r❦♦✈ ▼♦❞❡❧s✳ ❋♦r ❛♣✲ +♣❧✐❝❛t✐♦♥s ✐♥ ◆▲P✱ ■ r❡❝♦♠♠❡♥❞ ❝♦♥s✉❧t✐♥❣ ❏✉r❛❢s❦② ✫ ▼❛rt✐♥✬s ❞r❛❢t s❡❝♦♥❞ ❡❞✐✲ + +❙♣❡❡❝❤ ❛♥❞ ▲❛♥❣✉❛❣❡ Pr♦❝❡ss✐♥❣ ✶ ♦r ▼❛♥♥✐♥❣ ✫ ❙❝❤üt③❡✬s ❋♦✉♥❞❛t✐♦♥s ♦❢ +❙t❛t✐st✐❝❛❧ ◆❛t✉r❛❧ ▲❛♥❣✉❛❣❡ Pr♦❝❡ss✐♥❣✳ ❆❧s♦✱ ❊✐s♥❡r✬s ❍▼▼✲✐♥✲❛✲s♣r❡❛❞s❤❡❡t +t✐♦♥ ♦❢ + +❬✶❪ ✐s ❛ ❧✐❣❤t✲✇❡✐❣❤t ✐♥t❡r❛❝t✐✈❡ ✇❛② t♦ ♣❧❛② ✇✐t❤ ❛♥ ❍▼▼ t❤❛t r❡q✉✐r❡s ♦♥❧② ❛ +s♣r❡❛❞s❤❡❡t ❛♣♣❧✐❝❛t✐♦♥✳ + +❘❡❢❡r❡♥❝❡s +❬✶❪ ❏❛s♦♥ ❊✐s♥❡r✳ ❆♥ ✐♥t❡r❛❝t✐✈❡ s♣r❡❛❞s❤❡❡t ❢♦r t❡❛❝❤✐♥❣ t❤❡ ❢♦r✇❛r❞✲❜❛❝❦✇❛r❞ + +Pr♦❝❡❡❞✐♥❣s ♦❢ t❤❡ +❆❈▲ ❲♦r❦s❤♦♣ ♦♥ ❊✛❡❝t✐✈❡ ❚♦♦❧s ❛♥❞ ▼❡t❤♦❞♦❧♦❣✐❡s ❢♦r ❚❡❛❝❤✐♥❣ ◆▲P ❛♥❞ +❈▲✱ ♣❛❣❡s ✶✵✕✶✽✱ ✷✵✵✷✳ + +❛❧❣♦r✐t❤♠✳ ■♥ ❉r❛❣♦♠✐r ❘❛❞❡✈ ❛♥❞ ❈❤r✐s ❇r❡✇✱ ❡❞✐t♦rs✱ + +✶ ❤tt♣✿✴✴✇✇✇✳❝s✳❝♦❧♦r❛❞♦✳❡❞✉✴⑦♠❛rt✐♥✴s❧♣✷✳❤t♠❧ + +✶✸ + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-linalg.txt b/Lectures/aimlcs229/cs229-linalg.txt new file mode 100644 index 0000000..0d68b23 --- /dev/null +++ b/Lectures/aimlcs229/cs229-linalg.txt @@ -0,0 +1,1510 @@ +Linear Algebra Review and Reference +Zico Kolter +October 16, 2007 + +1 + +Basic Concepts and Notation + +Linear algebra provides a way of compactly representing and operating on sets of linear +equations. For example, consider the following system of equations: +4x1 − 5x2 = −13 +−2x1 + 3x2 = 9 . +This is two equations and two variables, so as you know from high school algebra, you +can find a unique solution for x1 and x2 (unless the equations are somehow degenerate, for +example if the second equation is simply a multiple of the first, but in the case above there +is in fact a unique solution). In matrix notation, we can write the system more compactly +as: +Ax = b +13 +4 −5 +. +,b= +with A = +−9 +−2 3 +As we will see shortly, there are many advantages (including the obvious space savings) +to analyzing linear equations in this form. + +1.1 + +Basic Notation + +We use the following notation: +• By A ∈ Rm×n we denote a matrix with m rows and n columns, where the entries of A +are real numbers. +• By x ∈ Rn , we denote a vector with n entries. Usually a vector x will denote a column +vector — i.e., a matrix with n rows and 1 column. If we want to explicitly represent +a row vector — a matrix with 1 row and n columns — we typically write xT (here +xT denotes the transpose of x, which we will define shortly). + +1 + + • The ith element of a vector x is denoted xi : + +x1 + x2 + +x =  .. + . + +xn + +• We use the notation aij (or Aij , Ai,j , etc) to +jth column: + +a11 a12 + a21 a22 + +A =  .. +.. + . +. +am1 am2 + + + + + +. + + +denote the entry of A in the ith row and +··· +··· +... + +a1n +a2n +.. +. + +· · · amn + + + + + +. + + +• We denote the jth column of A by aj or A:,j : + + +| | +| +A =  a 1 a2 · · · an  . +| | +| +• We denote the ith row of A by aTi or Ai,: : + + +— aT1 — + — aT —  +2 + + +A= +. +.. + + +. +T +— am — + +• Note that these definitions are ambiguous (for example, the a1 and aT1 in the previous +two definitions are not the same vector). Usually the meaning of the notation should +be obvious from its use. + +2 + +Matrix Multiplication + +The product of two matrices A ∈ Rm×n and B ∈ Rn×p is the matrix +C = AB ∈ Rm×p , +where + +n + +Cij = + +Aik Bkj . +k=1 + +Note that in order for the matrix product to exist, the number of columns in A must equal +the number of rows in B. There are many ways of looking at matrix multiplication, and +we’ll start by examining a few special cases. +2 + + 2.1 + +Vector-Vector Products + +Given two vectors x, y ∈ Rn , the quantity xT y, sometimes called the inner product or dot +product of the vectors, is a real number given by +n + +xT y ∈ R = + +xi yi . +i=1 + +Note that it is always the case that xT y = y T x. +Given vectors x ∈ Rm , y ∈ Rn (they no longer have to be the same size), xy T is called +the outer product of the vectors. It is a matrix whose entries are given by (xy T )ij = xi yj , +i.e., + + +x1 y1 x1 y2 · · · x1 yn + x2 y1 x2 y2 · · · x2 yn  + + +xy T ∈ Rm×n =  .. +.. +..  . +. +. + . +. +. +.  +xm y1 xm y2 · · · xm yn + +2.2 + +Matrix-Vector Products + +Given a matrix A ∈ Rm×n and a vector x ∈ Rn , their product is a vector y = Ax ∈ Rm . +There are a couple ways of looking at matrix-vector multiplication, and we will look at them +both. +If we write A by rows, then we can express Ax as, + + + + +aT1 x +— aT1 — + aT x  + — aT —  +2 + + 2  + +x += +y= + + ..  . +.. + + .  + +. +T +aTm x +— am — +In other words, the ith entry of y is equal to the inner product of the ith row of A and x, +yi = aTi x. +Alternatively, lets write A in column form. In this case we see that, + + + + + + + + + x1 + +| | +| + x2  + + +y =  a1 a2 · · · an   ..  =  a1  x1 +  a2  x2 + . . . +  an  xn . + .  +| | +| +xn + +In other words, y is a linear combination of the columns of A, where the coefficients of +the linear combination are given by the entries of x. +So far we have been multiplying on the right by a column vector, but it is also possible +to multiply on the left by a row vector. This is written, y T = xT A for A ∈ Rm×n , x ∈ Rm , +and y ∈ Rn . As before, we can express y T in two obvious ways, depending on whether we +3 + + express A in terms on its rows or columns. In the first case we express A in terms of its +columns, which gives + + +| | +| +y T = x T  a1 a 2 · · · an  = x T a 1 x T a2 · · · x T a n +| | +| + +which demonstrates that the ith entry of y T is equal to the inner product of x and the ith +column of A. +Finally, expressing A in terms of rows we get the final representation of the vector-matrix +product, + + +— aT1 — + — aT —  +2 + + +T +x1 x2 · · · xn  +y = + +.. + + +. +T +— am — += x1 + +— aT1 — + ++ x2 + +— aT2 — + ++ ... + xn + +— aTn — + +so we see that y T is a linear combination of the rows of A, where the coefficients for the +linear combination are given by the entries of x. + +2.3 + +Matrix-Matrix Products + +Armed with this knowledge, we can now look at four different (but, of course, equivalent) +ways of viewing the matrix-matrix multiplication C = AB as defined at the beginning of this +section. First we can view matrix-matrix multiplication as a set of vector-vector products. +The most obvious viewpoint, which follows immediately from the definition, is that the +i, j entry of C is equal to the inner product of the ith row of A and the jth row of B. +Symbolically, this looks like the following, + + + + +aT1 b1 aT1 b2 · · · aT1 bp +— aT1 —  + +| | +| + aT b 1 aT b 2 · · · aT b p  + — aT —  +2 +2 +2 + + 2 + + + +b +b +· +· +· +b += +C = AB =  + .. + +.. +..  . +.. +1 +2 +p +... + + + +. +. +.  +. +| | +| +T +T +T +T +a m b 1 am b 2 · · · am b p +— am — +Remember that since A ∈ Rm×n and B ∈ Rn×p , ai ∈ Rn and bj ∈ Rn , so these inner products +all make sense. This is the most “natural” representation when we represent A by rows and +B by columns. Alternatively, we can represent A by columns, and B by rows, which leads +to the interpretation of AB as a sum of outer products. Symbolically, + + + — bT1 — + +n +| | +| + — bT —  +2 + + + + +ai bTi . +C = AB = a1 a2 · · · an  += +.. + i=1 + +. +| | +| +T +— bn — +4 + + Put another way, AB is equal to the sum, over all i, of the outer product of the ith column +of A and the ith row of B. Since, in this case, ai ∈ Rm and bi ∈ Rp , the dimension of the +outer product ai bTi is m × p, which coincides with the dimension of C. +Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. +Specifically, if we represent B by columns, we can view the columns of C as matrix-vector +products between A and the columns of B. Symbolically, + +  + +| +| +| +| | +| +C = AB = A  b1 b2 · · · bp  =  Ab1 Ab2 · · · Abp  . +| +| +| +| | +| + +Here the ith column of C is given by the matrix-vector product with the vector on the right, +ci = Abi . These matrix-vector products can in turn be interpreted using both viewpoints +given in the previous subsection. Finally, we have the analogous viewpoint, where we represent A by rows, and view the rows of C as the matrix-vector product between the rows of A +and C. Symbolically, + + + + +— aT1 B — +— aT1 — + — aT B —  + — aT —  +2 +2 + + + + +B += +C = AB =  +. + + +.. +.. + + + + +. +. +T +T +— am B — +— am — + +Here the ith row of C is given by the matrix-vector product with the vector on the left, +cTi = aTi B. +It may seem like overkill to dissect matrix multiplication to such a large degree, especially +when all these viewpoints follow immediately from the initial definition we gave (in about a +line of math) at the beginning of this section. However, virtually all of linear algebra deals +with matrix multiplications of some kind, and it is worthwhile to spend some time trying to +develop an intuitive understanding of the viewpoints presented here. +In addition to this, it is useful to know a few basic properties of matrix multiplication at +a higher level: +• Matrix multiplication is associative: (AB)C = A(BC). +• Matrix multiplication is distributive: A(B + C) = AB + AC. +• Matrix multiplication is, in general, not commutative; that is, it can be the case that +AB = BA. + +3 + +Operations and Properties + +In this section we present several operations and properties of matrices and vectors. Hopefully a great deal of this will be review for you, so the notes can just serve as a reference for +these topics. +5 + + 3.1 + +The Identity Matrix and Diagonal Matrices + +The identity matrix , denoted I ∈ Rn×n , is a square matrix with ones on the diagonal and +zeros everywhere else. That is, +1 i=j +Iij = +0 i=j +It has the property that for all A ∈ Rm×n , +AI = A = IA +where the size of I is determined by the dimensions of A so that matrix multiplication is +possible. +A diagonal matrix is a matrix where all non-diagonal elements are 0. This is typically +denoted D = diag(d1 , d2 , . . . , dn ), with +Dij = + +di i = j +0 i=j + +Clearly, I = diag(1, 1, . . . , 1). + +3.2 + +The Transpose + +The transpose of a matrix results from “flipping” the rows and columns. Given a matrix +A ∈ Rm×n , is transpose, written AT , is defined as +AT ∈ Rn×m , (AT )ij = Aji . +We have in fact already been using the transpose when describing row vectors, since the +transpose of a column vector is naturally a row vector. +The following properties of transposes are easily verified: +• (AT )T = A +• (AB)T = B T AT +• (A + B)T = AT + B T + +3.3 + +Symmetric Matrices + +A square matrix A ∈ Rn×n is symmetric if A = AT . It is anti-symmetric if A = −AT . +It is easy to show that for any matrix A ∈ Rn×n , the matrix A + AT is symmetric and the +matrix A − AT is anti-symmetric. From this it follows that any square matrix A ∈ Rn×n can +be represented as a sum of a symmetric matrix and an anti-symmetric matrix, since +1 +1 +A = (A + AT ) + (A − AT ) +2 +2 +6 + + and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out +that symmetric matrices occur a great deal in practice, and they have many nice properties +which we will look at shortly. It is common to denote the set of all symmetric matrices of +size n as Sn , so that A ∈ Sn means that A is a symmetric n × n matrix; + +3.4 + +The Trace + +The trace of a square matrix A ∈ Rn×n , denoted tr(A) (or just trA if the parentheses are +obviously implied), is the sum of diagonal elements in the matrix: +n + +trA = + +Aii . +i=1 + +As described in the CS229 lecture notes, the trace has the following properties (included +here for the sake of completeness): +• For A ∈ Rn×n , trA = trAT . +• For A, B ∈ Rn×n , tr(A + B) = trA + trB. +• For A ∈ Rn×n , t ∈ R, tr(tA) = t trA. +• For A, B such that AB is square, trAB = trBA. +• For A, B, C such that ABC is square, trABC = trBCA = trCAB, and so on for the +product of more matrices. + +3.5 + +Norms + +A norm of a vector x is informally measure of the “length” of the vector. For example, +we have the commonly-used Euclidean or ℓ2 norm, +n + +x + +2 + +x2i . + += +i=1 + +Note that x 22 = xT x. +More formally, a norm is any function f : Rn → R that satisfies 4 properties: +1. For all x ∈ Rn , f (x) ≥ 0 (non-negativity). +2. f (x) = 0 if and only if x = 0 (definiteness). +3. For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x) (homogeneity). +4. For all x, y ∈ Rn , f (x + y) ≤ f (x) + f (y) (triangle inequality). +7 + + Other examples of norms are the ℓ1 norm, +n + +x + +1 + +|xi | + += +i=1 + +and the ℓ∞ norm, +x + +∞ + += maxi |xi |. + +In fact, all three norms presented so far are examples of the family of ℓp norms, which are +parameterized by a real number p ≥ 1, and defined as +1/p + +n + +x + +p + +|xi |p + += + +. + +i=1 + +Norms can also be defined for matrices, such as the Frobenius norm, +m + +A + +F + +n + +A2ij = + += + +tr(AT A). + +i=1 j=1 + +Many other norms exist, but they are beyond the scope of this review. + +3.6 + +Linear Independence and Rank + +A set of vectors {x1 , x2 , . . . xn } is said to be (linearly) independent if no vector can be +represented as a linear combination of the remaining vectors. Conversely, a vector which +can be represented as a linear combination of the remaining vectors is said to be (linearly) +dependent. For example, if +n−1 + +xn = + +αi xi +i=1 + +for some {α1 , . . . , αn−1 } then xn is dependent on {x1 , . . . , xn−1 }; otherwise, it is independent +of {x1 , . . . , xn−1 }. +The column rank of a matrix A is the largest number of columns of A that constitute +linearly independent set. This is often referred to simply as the number of linearly independent columns, but this terminology is a little sloppy, since it is possible that any vector in +some set {x1 , . . . xn } can be expressed as a linear combination of the remaining vectors, even +though some subset of the vectors might be independent. In the same way, the row rank +is the largest number of rows of A that constitute a linearly independent set. +It is a basic fact of linear algebra, that for any matrix A, columnrank(A) = rowrank(A), +and so this quantity is simply refereed to as the rank of A, denoted as rank(A). The +following are some basic properties of the rank: +• For A ∈ Rm×n , rank(A) ≤ min(m, n). If rank(A) = min(m, n), then A is said to be +full rank . +8 + + • For A ∈ Rm×n , rank(A) = rank(AT ). +• For A ∈ Rm×n , B ∈ Rn×p , rank(AB) ≤ min(rank(A), rank(B)). +• For A, B ∈ Rm×n , rank(A + B) ≤ rank(A) + rank(B). + +3.7 + +The Inverse + +The inverse of a square matrix A ∈ Rn×n is denoted A−1 , and is the unique matrix such +that +A−1 A = I = AA−1 . +It turns out that A−1 may not exist for some matrices A; we say A is invertible or nonsingular if A−1 exists and non-invertible or singular otherwise. One condition for +invertibility we already know: it is possible to show that A−1 exists if and only if A is full +rank. We will soon see that there are many alternative sufficient and necessary conditions, in +addition to full rank, for invertibility. The following are properties of the inverse; all assume +that A, B ∈ Rn×n are non-singular: +• (A−1 )−1 = A +• If Ax = b, we can multiply by A−1 on both sides to obtain x = A−1 b. This demonstrates +the inverse with respect to the original system of linear equalities we began this review +with. +• (AB)−1 = B −1 A−1 +• (A−1 )T = (AT )−1 . For this reason this matrix is often denoted A−T . + +3.8 + +Orthogonal Matrices + +Two vectors x, y ∈ Rn are orthogonal if xT y = 0. A vector x ∈ Rn is normalized if +x 2 = 1. A square matrix U ∈ Rn×n is orthogonal (note the different meanings when +talking about vectors versus matrices) if all its columns are orthogonal to each other and are +normalized (the columns are then referred to as being orthonormal ). +It follows immediately from the definition of orthogonality and normality that +UT U = I = UUT . +In other words, the inverse of an orthogonal matrix is its transpose. Note that if U is not +square — i.e., U ∈ Rm×n , n < m — but its columns are still orthonormal, then U T U = I, +but U U T = I. We generally only use the term orthogonal to describe the previous case, +where U is square. +Another nice property of orthogonal matrices is that operating on a vector with an +orthogonal matrix will not change its Euclidean norm, i.e., +Ux +n + +n×n + +for any x ∈ R , U ∈ R + +2 + += x + +orthogonal. +9 + +2 + + 3.9 + +Range and Nullspace of a Matrix + +The span of a set of vectors {x1 , x2 , . . . xn } is the set of all vectors that can be expressed as +a linear combination of {x1 , . . . , xn }. That is, +n + +span({x1 , . . . xn }) = + +αi xi , αi ∈ R . + +v:v= +i=1 + +It can be shown that if {x1 , . . . , xn } is a set of n linearly independent vectors, where each +xi ∈ Rn , then span({x1 , . . . xn }) = Rn . In other words, any vector v ∈ Rn can be written as +a linear combination of x1 through xn . The projection of a vector y ∈ Rm onto the span +of {x1 , . . . , xn } (here we assume xi ∈ Rm ) is the vector v ∈ span({x1 , . . . xn }) , such that +v as close as possible to y, as measured by the Euclidean norm v − y 2 . We denote the +projection as Proj(y; {x1 , . . . , xn }) and can define it formally as, +Proj(y; {x1 , . . . xn }) = argminv∈span({x1 ,...,xn }) y − v 2 . +The range (sometimes also called the columnspace) of a matrix A ∈ Rm×n , denoted +R(A), is the the span of the columns of A. In other words, +R(A) = {v ∈ Rm : v = Ax, x ∈ Rn }. +Making a few technical assumptions (namely that A is full rank and that n < m), the +projection of a vector y ∈ Rm onto the range of A is given by, +Proj(y; A) = argminv∈R(A) v − y + +2 + += A(AT A)−1 AT y . + +This last equation should look extremely familiar, since it is almost the same formula we +derived in class (and which we will soon derive again) for the least squares estimation of +parameters. Looking at the definition for the projection, it should not be too hard to +convince yourself that this is in fact the same objective that we minimized in our least +squares problem (except for a squaring of the norm, which doesn’t affect the optimal point) +and so these problems are naturally very connected. When A contains only a single column, +a ∈ Rm , this gives the special case for a projection of a vector on to a line: +aaT +Proj(y; a) = T y . +a a +m×n +The nullspace of a matrix A ∈ R +, denoted N (A) is the set of all vectors that equal +0 when multiplied by A, i.e., +N (A) = {x ∈ Rn : Ax = 0}. +Note that vectors in R(A) are of size m, while vectors in the N (A) are of size n, so vectors +in R(AT ) and N (A) are both in Rn . In fact, we can say much more. It turns out that +w : w = u + v, u ∈ R(AT ), v ∈ N (A) = Rn and R(AT ) ∩ N (A) = ∅ . +In other words, R(AT ) and N (A) are disjoint subsets that together span the entire space of +Rn . Sets of this type are called orthogonal complements, and we denote this R(AT ) = +N (A)⊥ . +10 + + 3.10 + +The Determinant + +The determinant of a square matrix A ∈ Rn×n , is a function det : Rn×n → R, and is +denoted |A| or detA (like the trace operator, we usually omit parentheses). The full formula +for the determinant gives little intuition about its meaning, so we instead first give three +defining properties of the determinant, from which all the rest follow (including the general +formula): +1. The determinant of the identity is 1, |I| = 1. +2. Given a matrix A ∈ Rn×n , if we multiply +determinant of the new matrix is t|A|, + +— t aT1 + — aT +2 + + +.. + +. +— aTm +3. If we exchange any two rows aTi and aTj +is −|A|, for example + +— aT2 + — aT +1 + + +.. + +. +— aTm + +a single row in A by a scalar t ∈ R, then the + +— +—  + + = t|A| . + +— + +of A, then the determinant of the new matrix + +— +—  + + = −|A| . + +— + +These properties, however, also give very little intuition about the nature of the determinant, so we now list several properties that follow from the three properties above: +• For A ∈ Rn×n , |A| = |AT |. +• For A, B ∈ Rn×n , |AB| = |A||B|. +• For A ∈ Rn×n , |A| = 0 if and only if A is singular (i.e., non-invertible). +• For A ∈ Rn×n and A non-singular, |A|−1 = 1/|A|. +Before given the general definition for the determinant, we define, for A ∈ Rn×n , A\i,\j ∈ +R(n−1)×(n−1) to be the matrix that results from deleting the ith row and jth column from A. +The general (recursive) formula for the determinant is +n + +|A| = + +(−1)i+j aij |A\i,\j | + +(for any j ∈ 1, . . . , n) + +(−1)i+j aij |A\i,\j | + +(for any i ∈ 1, . . . , n) + +i=1 +n + += +j=1 + +11 + + with the initial case that |A| = a11 for A ∈ R1×1 . If we were to expand this formula +completely for A ∈ Rn×n , there would be a total of n! (n factorial) different terms. For this +reason, we hardly even explicitly write the complete equation of the determinant for matrices +bigger than 3 × 3. However, the equations for determinants of matrices up to size 3 × 3 are +fairly common, and it is good to know them: + +a11 +a21 + + +a11 a12 + a21 a22 +a31 a32 + +|[a11 ]| = a11 +a12 += a11 a22 − a12 a21 +a22 + +a13 +a11 a22 a33 + a12 a23 a31 + a13 a21 a32 +a23  = +−a11 a23 a32 − a12 a21 a33 − a13 a22 a31 +a33 + +The classical adjoint (often just called the adjoint) of a matrix A ∈ Rn×n , is denoted +adj(A), and defined as +adj(A) ∈ Rn×n , (adj(A))ij = (−1)i+j |A\j,\i | +(note the switch in the indices A\j,\i ). It can be shown that for any nonsingular A ∈ Rn×n , +A−1 = + +1 +adj(A) . +|A| + +While this is a nice “explicit” formula for the inverse of matrix, we should note that, numerically, there are in fact much more efficient ways of computing the inverse. + +3.11 + +Quadratic Forms and Positive Semidefinite Matrices + +Given a matrix square A ∈ Rn×n and a vector x ∈ R, the scalar value xT Ax is called a +quadratic form. Written explicitly, we see that +n + +n + +T + +x Ax = + +Aij xi xj . +i=1 j=1 + +Note that, + +1 +1 +xT Ax = (xT Ax)T = xT AT x = xT ( A + AT )x +2 +2 +i.e., only the symmetric part of A contributes to the quadratic form. For this reason, we +often implicitly assume that the matrices appearing in a quadratic form are symmetric. +We give the following definitions: +• A symmetric matrix A ∈ Sn is positive definite (PD) if for all non-zero vectors +x ∈ Rn , xT Ax > 0. This is usually denoted A ≻ 0 (or just A > 0), and often times the +set of all positive definite matrices is denoted Sn++ . +12 + + • A symmetric matrix A ∈ Sn is position semidefinite (PSD) if for all vectors xT Ax ≥ +0. This is written A 0 (or just A ≥ 0), and the set of all positive semidefinite matrices +is often denoted Sn+ . +• Likewise, a symmetric matrix A ∈ Sn is negative definite (ND), denoted A ≺ 0 (or +just A < 0) if for all non-zero x ∈ Rn , xT Ax < 0. +• Similarly, a symmetric matrix A ∈ Sn is negative semidefinite (NSD), denoted +A 0 (or just A ≤ 0) if for all x ∈ Rn , xT Ax ≤ 0. +• Finally, a symmetric matrix A ∈ Sn is indefinite, if it is neither positive semidefinite +nor negative semidefinite — i.e., if there exists x1 , x2 ∈ Rn such that xT1 Ax1 > 0 and +xT2 Ax2 < 0. +It should be obvious that if A is positive definite, then −A is negative definite and vice +versa. Likewise, if A is positive semidefinite then −A is negative semidefinite and vice versa. +If A is indefinite, then so is −A. It can also be shown that positive definite and negative +definite matrices are always invertible. +Finally, there is one type of positive definite matrix that comes up frequently, and so +deserves some special mention. Given any matrix A ∈ Rm×n (not necessarily symmetric or +even square), the matrix G = AT A (sometimes called a Gram matrix ) is always positive +semidefinite. Further, if m ≥ n (and we assume for convenience that A is full rank), then +G = AT A is positive definite. + +3.12 + +Eigenvalues and Eigenvectors + +Given a square matrix A ∈ Rn×n , we say that λ ∈ C is an eigenvalue of A and x ∈ Cn is +the corresponding eigenvector 1 if +Ax = λx, x = 0 . +Intuitively, this definition means that multiplying A by the vector x results in a new vector +that points in the same direction as x, but scaled by a factor λ. Also note that for any +eigenvector x ∈ Cn , and scalar t ∈ C, A(cx) = cAx = cλx = λ(cx), so cx is also an +eigenvector. For this reason when we talk about “the” eigenvector associated with λ, we +usually assume that the eigenvector is normalized to have length 1 (this still creates some +ambiguity, since x and −x will both be eigenvectors, but we will have to live with this). +We can rewrite the equation above to state that (λ, x) is an eigenvalue-eigenvector pair +of A if, +(λI − A)x = 0, x = 0 . +1 + +Note that λ and the entries of x are actually in C, the set of complex numbers, not just the reals; we +will see shortly why this is necessary. Don’t worry about this technicality for now, you can think of complex +vectors in the same way as real vectors. + +13 + + But (λI − A)x = 0 has a non-zero solution to x if and only if (λI − A) has a non-empty +nullspace, which is only the case if (λI − A) is singular, i.e., +|(λI − A)| = 0 . +We can now use the previous definition of the determinant to expand this expression +into a (very large) polynomial in λ, where λ will have maximum degree n. We then find +the n (possibly complex) roots of this polynomial to find the n eigenvalues λ1 , . . . , λn . To +find the eigenvector corresponding to the eigenvalue λi , we simply solve the linear equation +(λi I − A)x = 0. It should be noted that this is not the method which is actually used +in practice to numerically compute the eigenvalues and eigenvectors (remember that the +complete expansion of the determinant has n! terms); it is rather a mathematical argument. +The following are properties of eigenvalues and eigenvectors (in all cases assume A ∈ Rn×n +has eigenvalues λi , . . . , λn and associated eigenvectors x1 , . . . xn ): +• The trace of a A is equal to the sum of its eigenvalues, +n + +trA = + +λi . +i=1 + +• The determinant of A is equal to the product of its eigenvalues, +n + +|A| = + +λi . +i=1 + +• The rank of A is equal to the number of non-zero eigenvalues of A. +• If A is non-singular then 1/λi is an eigenvalue of A−1 with associated eigenvector xi , +i.e., A−1 xi = (1/λi )xi . +• The eigenvalues of a diagonal matrix D = diag(d1 , . . . dn ) are just the diagonal entries +d1 , . . . dn . +We can write all the eigenvector equations simultaneously as +AX = XΛ +where the columns of X ∈ Rn×n are the eigenvectors of A and Λ is a diagonal matrix whose +entries are the eigenvalues of A, i.e., + + +| +| +| +X ∈ Rn×n =  x1 x2 · · · xn  , Λ = diag(λ1 , . . . , λn ) . +| +| +| +If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so +A = XΛX −1 . A matrix that can be written in this form is called diagonalizable. +14 + + 3.13 + +Eigenvalues and Eigenvectors of Symmetric Matrices + +Two remarkable properties come about when we look at the eigenvalues and eigenvectors +of a symmetric matrix A ∈ Sn . First, it can be shown that all the eigenvalues of A are +real. Secondly, the eigenvectors of A are orthonormal, i.e., the matrix X defined above is an +orthogonal matrix (for this reason, we denote the matrix of eigenvectors as U in this case). +We can therefore represent A as A = U ΛU T , remembering from above that the inverse of +an orthogonal matrix is just its transpose. +Using this, we can show that the definiteness of a matrix depends entirely on the sign of +its eigenvalues. Suppose A ∈ Sn = U ΛU T . Then +n + +xT Ax = xT U ΛU T x = y T Λy = + +λi yi2 +i=1 + +where y = U T x (and since U is full rank, any vector y ∈ Rn can be represented in this form). +Because yi2 is always positive, the sign of this expression depends entirely on the λi ’s. If all +λi > 0, then the matrix is positive definite; if all λi ≥ 0, it is positive semidefinite. Likewise, +if all λi < 0 or λi ≤ 0, then A is negative definite or negative semidefinite respectively. +Finally, if A has both positive and negative eigenvalues, it is indefinite. +An application where eigenvalues and eigenvectors come up frequently is in maximizing +some function of a matrix. In particular, for a matrix A ∈ Sn , consider the following +maximization problem, +maxx∈Rn xT Ax + +subject to x + +2 +2 + +=1 + +i.e., we want to find the vector (of norm 1) which maximizes the quadratic form. Assuming +the eigenvalues are ordered as λ1 ≥ λ2 ≥ . . . ≥ λn , the optimal x for this optimization +problem is x1 , the eigenvector corresponding to λ1 . In this case the maximal value of the +quadratic form is λ1 . Similarly, the optimal solution to the minimization problem, +minx∈Rn xT Ax + +subject to x + +2 +2 + +=1 + +is xn , the eigenvector corresponding to λn , and the minimal value is λn . This can be proved by +appealing to the eigenvector-eigenvalue form of A and the properties of orthogonal matrices. +However, in the next section we will see a way of showing it directly using matrix calculus. + +4 + +Matrix Calculus + +While the topics in the previous sections are typically covered in a standard course on linear +algebra, one topic that does not seem to be covered very often (and which we will use +extensively) is the extension of calculus to the vector setting. Despite the fact that all the +actual calculus we use is relatively trivial, the notation can often make things look much +more difficult than they are. In this section we present some basic definitions of matrix +calculus and provide a few examples. +15 + + 4.1 + +The Gradient + +Suppose that f : Rm×n → R is a function that takes as input a matrix A of size m × n and +returns a real value. Then the gradient of f (with respect to A ∈ Rm×n ) is the matrix of +partial derivatives, defined as: + + +∂f (A) +∂f (A) +∂f (A) +· +· +· +∂A11 +∂A12 +∂A1n + ∂f +(A) +∂f (A) +∂f (A)  + + +· +· +· +∂A21 +∂A22 +∂A2n  +∇A f (A) ∈ Rm×n =  +.. +..  +... + .. +. +.  + . +∂f (A) +∂Am2 + +∂f (A) +∂Am1 + +··· + +∂f (A) +∂Amn + +i.e., an m × n matrix with + +(∇A f (A))ij = + +∂f (A) +. +∂Aij + +Note that the size of ∇A f (A) is always the same as the size of A. So if, in particular, A is +just a vector x ∈ Rn , + + + + +∇x f (x) =  + + + +∂f (x) +∂x1 +∂f (x) +∂x2 + +.. +. + +∂f (x) +∂xn + + + +. + + + +It is very important to remember that the gradient of a function is only defined if the function +is real-valued, that is, if it returns a scalar value. We can not, for example, take the gradient +of Ax, A ∈ Rn×n with respect to x, since this quantity is vector-valued. +It follows directly from the equivalent properties of partial derivatives that: +• ∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x). +• For t ∈ R, ∇x (t f (x)) = t∇x f (x). +It is a little bit trickier to determine what the proper expression is for ∇x f (Ax), A ∈ Rn×n , +but this is doable as well (if fact, you’ll have to work this out for a homework problem). + +4.2 + +The Hessian + +Suppose that f : Rn → R is a function that takes a vector in Rn and returns a real number. +Then the Hessian matrix with respect to x, written ∇2x f (x) or simply as H is the n × n +matrix of partial derivatives, + ∂ 2 f (x) ∂ 2 f (x) +∂ 2 f (x)  +· · · ∂x +∂x1 ∂x2 +∂x21 +1 ∂xn + ∂ 2 f (x) ∂ 2 f (x) +∂ 2 f (x)  + ∂x2 ∂x1 +· · · ∂x2 ∂xn  +∂x22 +. +∇2x f (x) ∈ Rn×n =  + + +.. +.. +.. +... + + +. +. +. +∂ 2 f (x) +∂ 2 f (x) +∂ 2 f (x) +··· +∂xn ∂x1 +∂xn ∂x2 +∂x2 +n + +16 + + In other words, ∇2x f (x) ∈ Rn×n , with +(∇2x f (x))ij = + +∂ 2 f (x) +. +∂xi ∂xj + +Note that the Hessian is always symmetric, since +∂ 2 f (x) +∂ 2 f (x) += +. +∂xi ∂xj +∂xj ∂xi +Similar to the gradient, the Hessian is defined only when f (x) is real-valued. +It is natural to think of the gradient as the analogue of the first derivative for functions +of vectors, and the Hessian as the analogue of the second derivative (and the symbols we +use also suggest this relation). This intuition is generally correct, but there a few caveats to +keep in mind. +First, for real-valued functions of one variable f : R → R, it is a basic definition that the +second derivative is the derivative of the first derivative, i.e., +∂ 2 f (x) +∂ ∂ += +f (x). +2 +∂x +∂x ∂x +However, for functions of a vector, the gradient of the function is a vector, and we cannot +take the gradient of a vector — i.e., + + + + +∇x ∇x f (x) = ∇x  + + + +∂f (x) +∂x1 +∂f (x) +∂x2 + +.. +. + +∂f (x) +∂x1 + + + + + + + +and this expression is not defined. Therefore, it is not the case that the Hessian is the +gradient of the gradient. However, this is almost true, in the following sense: If we look at +the ith entry of the gradient (∇x f (x))i = ∂f (x)/∂xi , and take the gradient with respect to +x we get + + + +∂f (x)  +∇x += + +∂xi + + +∂ 2 f (x) +∂xi ∂x1 +∂ 2 f (x) +∂xi ∂x2 + +.. +. + +∂f (x) +∂xi ∂xn + + + + + + + +which is the ith column (or row) of the Hessian. Therefore, +∇2x f (x) = + +∇x (∇x f (x))1 ∇x (∇x f (x))2 · · · + +∇x (∇x f (x))n + +. + +If we don’t mind being a little bit sloppy we can say that (essentially) ∇2x f (x) = ∇x (∇x f (x))T , +so long as we understand that this really means taking the gradient of each entry of (∇x f (x))T , +not the gradient of the whole vector. +17 + + Finally, note that while we can take the gradient with respect to a matrix A ∈ Rn , for +the purposes of this class we will only consider taking the Hessian with respect to a vector +x ∈ Rn . This is simply a matter of convenience (and the fact that none of the calculations +we do require us to find the Hessian with respect to a matrix), since the Hessian with respect +to a matrix would have to represent all the partial derivatives ∂ 2 f (A)/(∂Aij ∂Akℓ ), and it is +rather cumbersome to represent this as a matrix. + +4.3 + +Gradients and Hessians of Quadratic and Linear Functions + +Now let’s try to determine the gradient and Hessian matrices for a few simple functions. It +should be noted that all the gradients given here are special cases of the gradients given in +the CS229 lecture notes. +For x ∈ Rn , let f (x) = bT x for some known vector b ∈ Rn . Then +n + +f (x) = + +bi xi +i=1 + +so +∂f (x) +∂ += +∂xk +∂xk + +n + +bi xi = bk . +i=1 + +T + +From this we can easily see that ∇x b x = b. This should be compared to the analogous +situation in single variable calculus, where ∂/(∂x) ax = a. +Now consider the quadratic function f (x) = xT Ax for A ∈ Sn . Remember that +n + +n + +f (x) = + +Aij xi xj +i=1 j=1 + +so +∂f (x) +∂ += +∂xk +∂xk + +n + +n + +n + +Aij xi xj = +i=1 j=1 + +n + +Aik xi + +i=1 + +n + +Akj xj = 2 +j=1 + +Aki xi +i=1 + +where the last equality follows since A is symmetric (which we can safely assume, since it is +appearing in a quadratic form). Note that the kth entry of ∇x f (x) is just the inner product +of the kth row of A and x. Therefore, ∇x xT Ax = 2Ax. Again, this should remind you of +the analogous fact in single-variable calculus, that ∂/(∂x) ax2 = 2ax. +Finally, lets look at the Hessian of the quadratic function f (x) = xT Ax (it should be +obvious that the Hessian of a linear function bT x is zero). This is even easier than determining +the gradient of the function, since +∂ 2 f (x) +∂2 += +∂xk ∂xℓ +∂xk ∂xℓ + +n + +n + +Aij xi xj = Akℓ + Aℓk = 2Akℓ . +i=1 j=1 + +Therefore, it should be clear that ∇2x xT Ax = 2A, which should be entirely expected (and +again analogous to the single-variable fact that ∂ 2 /(∂x2 ) ax2 = 2a). +To recap, +18 + + • ∇x bT x = b +• ∇x xT Ax = 2Ax (if A symmetric) +• ∇2x xT Ax = 2A (if A symmetric) + +4.4 + +Least Squares + +Lets apply the equations we obtained in the last section to derive the least squares equations. +Suppose we are given matrices A ∈ Rm×n (for simplicity we assume A is full rank) and a +vector b ∈ Rm such that b ∈ R(A). In this situation we will not be able to find a vector +x ∈ Rn , such that Ax = b, so instead we want to find a vector x such that Ax is as close as +possible to b, as measured by the square of the Euclidean norm Ax − b 22 . +Using the fact that x 22 = xT x, we have +Ax − b + +2 +2 + += (Ax − b)T (Ax − b) += xT AT Ax − 2bT Ax + bT b + +Taking the gradient with respect to x we have, and using the properties we derived in the +previous section +∇x (xT AT Ax − 2bT Ax + bT b) = ∇x xT AT Ax − ∇x 2bT Ax + ∇x bT b += 2AT Ax − 2AT b +Setting this last expression equal to zero and solving for x gives the normal equations +x = (AT A)−1 AT b +which is the same as what we derived in class. + +4.5 + +Gradients of the Determinant + +Now lets consider a situation where we find the gradient of a function with respect to a matrix, +namely for A ∈ Rn×n , we want to find ∇A |A|. Recall from our discussion of determinants +that +n +(−1)i+j Aij |A\i,\j | + +|A| = + +(for any j ∈ 1, . . . , n) + +i=1 + +so +∂ +∂ +|A| = +∂Akℓ +∂Akℓ + +n + +(−1)i+j Aij |A\i,\j | = (−1)k+ℓ |A\k,\ℓ | = (adj(A))ℓk . +i=1 + +From this it immediately follows from the properties of the adjoint that +∇A |A| = (adj(A))T = |A|A−T . +19 + + Now lets consider the function f : Sn++ → R, f (A) = log |A|. Note that we have to +restrict the domain of f to be the positive definite matrices, since this ensures that |A| > 0, +so that the log of |A| is a real number. In this case we can use the chain rule (nothing fancy, +just the ordinary chain rule from single-variable calculus) to see that +∂ log |A| ∂|A| +1 ∂|A| +∂ log |A| += += +. +∂Aij +∂|A| ∂Aij +|A| ∂Aij +From this is should be obvious that +∇A log |A| = + +1 +∇A |A| = A−1 , +|A| + +where we can drop the transpose in the last expression because A is symmetric. Note the +similarity to the single-valued case, where ∂/(∂x) log x = 1/x. + +4.6 + +Eigenvalues as Optimization + +Finally, we use matrix calculus to solve an optimization problem in a way that leads directly +to eigenvalue/eigenvector analysis. Consider the following, equality constrained optimization +problem: +maxx∈Rn xT Ax subject to x 22 = 1 +for a symmetric matrix A ∈ Sn . A standard way of solving optimization problems with +equality constraints is by forming the Lagrangian, an objective function that includes the +equality constraints.2 The Lagrangian in this case can be given by +L(x, λ) = xT Ax − λxT x +where λ is called the Lagrange multiplier associated with the equality constraint. It can be +established that for x∗ to be a optimal point to the problem, the gradient of the Lagrangian +has to be zero at x∗ (this is not the only condition, but it is required). That is, +∇x L(x, λ) = ∇x (xT Ax − λxT x) = 2AT x − 2λx = 0. +Notice that this is just the linear equation Ax = λx. This shows that the only points which +can possibly maximize (or minimize) xT Ax assuming xT x = 1 are the eigenvectors of A. + +2 +Don’t worry if you haven’t seen Lagrangians before, as we will cover them in greater detail later in +CS229. + +20 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes1.txt b/Lectures/aimlcs229/cs229-notes1.txt new file mode 100644 index 0000000..8fa653a --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes1.txt @@ -0,0 +1,2031 @@ +CS229 Lecture notes +Andrew Ng + +Supervised learning +Lets start by talking about a few examples of supervised learning problems. +Suppose we have a dataset giving the living areas and prices of 47 houses +from Portland, Oregon: +Living area (feet2 ) +2104 +1600 +2400 +1416 +3000 +.. +. + +Price (1000$s) +400 +330 +369 +232 +540 +.. +. + +We can plot this data: +housing prices +1000 +900 +800 + +price (in $1000) + +700 +600 +500 +400 +300 +200 +100 +0 +500 + +1000 + +1500 + +2000 + +2500 +3000 +square feet + +3500 + +4000 + +4500 + +5000 + +Given data like this, how can we learn to predict the prices of other houses +in Portland, as a function of the size of their living areas? +1 + + 2 + +CS229 Winter 2003 + +To establish notation for future use, we’ll use x(i) to denote the “input” +variables (living area in this example), also called input features, and y (i) +to denote the “output” or target variable that we are trying to predict +(price). A pair (x(i) , y (i) ) is called a training example, and the dataset +that we’ll be using to learn—a list of m training examples {(x(i) , y (i) ); i = +1, . . . , m}—is called a training set. Note that the superscript “(i)” in the +notation is simply an index into the training set, and has nothing to do with +exponentiation. We will also use X denote the space of input values, and Y +the space of output values. In this example, X = Y = R. +To describe the supervised learning problem slightly more formally, our +goal is, given a training set, to learn a function h : X → Y so that h(x) is a +“good” predictor for the corresponding value of y. For historical reasons, this +function h is called a hypothesis. Seen pictorially, the process is therefore +like this: +Training +set + +Learning +algorithm +x +(living area of +house.) + +h + +predicted y +(predicted price) +of house) + +When the target variable that we’re trying to predict is continuous, such +as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as +if, given the living area, we wanted to predict if a dwelling is a house or an +apartment, say), we call it a classification problem. + + 3 + +Part I + +Linear Regression +To make our housing example more interesting, lets consider a slightly richer +dataset in which we also know the number of bedrooms in each house: +Living area (feet2 ) #bedrooms Price (1000$s) +2104 +3 +400 +1600 +3 +330 +3 +369 +2400 +1416 +2 +232 +3000 +4 +540 +.. +.. +.. +. +. +. +(i) + +Here, the x’s are two-dimensional vectors in R2 . For instance, x1 is the +(i) +living area of the i-th house in the training set, and x2 is its number of +bedrooms. (In general, when designing a learning problem, it will be up to +you to decide what features to choose, so if you are out in Portland gathering +housing data, you might also decide to include other features such as whether +each house has a fireplace, the number of bathrooms, and so on. We’ll say +more about feature selection later, but for now lets take the features as given.) +To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, lets say +we decide to approximate y as a linear function of x: +hθ (x) = θ0 + θ1 x1 + θ2 x2 +Here, the θi ’s are the parameters (also called weights) parameterizing the +space of linear functions mapping from X to Y. When there is no risk of +confusion, we will drop the θ subscript in hθ (x), and write it more simply as +h(x). To simplify our notation, we also introduce the convention of letting +x0 = 1 (this is the intercept term), so that +n + +θi xi = θT x, + +h(x) = +i=0 + +where on the right-hand side above we are viewing θ and x both as vectors, +and here n is the number of input variables (not counting x0 ). +Now, given a training set, how do we pick, or learn, the parameters θ? +One reasonable method seems to be to make h(x) close to y, at least for + + 4 +the training examples we have. To formalize this, we will define a function +that measures, for each value of the θ’s, how close the h(x(i) )’s are to the +corresponding y (i) ’s. We define the cost function: +1 +J(θ) = +2 + +m + +i=1 + +(hθ (x(i) ) − y (i) )2 . + +If you’ve seen linear regression before, you may recognize this as the familiar +least-squares cost function that gives rise to the ordinary least squares +regression model. Whether or not you have seen it previously, lets keep +going, and we’ll eventually show this to be a special case of a much broader +family of algorithms. + +1 + +LMS algorithm + +We want to choose θ so as to minimize J(θ). To do so, lets use a search +algorithm that starts with some “initial guess” for θ, and that repeatedly +changes θ to make J(θ) smaller, until hopefully we converge to a value of +θ that minimizes J(θ). Specifically, lets consider the gradient descent +algorithm, which starts with some initial θ, and repeatedly performs the +update: +∂ +θj := θj − α +J(θ). +∂θj +(This update is simultaneously performed for all values of j = 0, . . . , n.) +Here, α is called the learning rate. This is a very natural algorithm that +repeatedly takes a step in the direction of steepest decrease of J. +In order to implement this algorithm, we have to work out what is the +partial derivative term on the right hand side. Lets first work it out for the +case of if we have only one training example (x, y), so that we can neglect +the sum in the definition of J. We have: +∂ +∂ 1 +(hθ (x) − y)2 +J(θ) = +∂θj +∂θj 2 +∂ +1 +(hθ (x) − y) += 2 · (hθ (x) − y) · +2 +∂θj +∂ += (hθ (x) − y) · +∂θj += (hθ (x) − y) xj + +n + +i=0 + +θi x i − y + + 5 +For a single training example, this gives the update rule:1 +(i) + +θj := θj + α y (i) − hθ (x(i) ) xj . +The rule is called the LMS update rule (LMS stands for “least mean squares”), +and is also known as the Widrow-Hoff learning rule. This rule has several +properties that seem natural and intuitive. For instance, the magnitude of +the update is proportional to the error term (y (i) − hθ (x(i) )); thus, for instance, if we are encountering a training example on which our prediction +nearly matches the actual value of y (i) , then we find that there is little need +to change the parameters; in contrast, a larger change to the parameters will +be made if our prediction hθ (x(i) ) has a large error (i.e., if it is very far from +y (i) ). +We’d derived the LMS rule for when there was only a single training +example. There are two ways to modify this method for a training set of +more than one example. The first is replace it with the following algorithm: +Repeat until convergence { +θj := θj + α + +m +i=1 + +(i) + +y (i) − hθ (x(i) ) xj + +(for every j). + +} +The reader can easily verify that the quantity in the summation in the update +rule above is just ∂J(θ)/∂θj (for the original definition of J). So, this is +simply gradient descent on the original cost function J. This method looks +at every example in the entire training set on every step, and is called batch +gradient descent. Note that, while gradient descent can be susceptible +to local minima in general, the optimization problem we have posed here +for linear regression has only one global, and no other local, optima; thus +gradient descent always converges (assuming the learning rate α is not too +large) to the global minimum. Indeed, J is a convex quadratic function. +Here is an example of gradient descent as it is run to minimize a quadratic +function. +1 + +We use the notation “a := b” to denote an operation (in a computer program) in +which we set the value of a variable a to be equal to the value of b. In other words, this +operation overwrites a with the value of b. In contrast, we will write “a = b” when we are +asserting a statement of fact, that the value of a is equal to the value of b. + + 6 +50 + +45 + +40 + +35 + +30 + +25 + +20 + +15 + +10 + +5 + +5 + +10 + +15 + +20 + +25 + +30 + +35 + +40 + +45 + +50 + +The ellipses shown above are the contours of a quadratic function. Also +shown is the trajectory taken by gradient descent, with was initialized at +(48,30). The x’s in the figure (joined by straight lines) mark the successive +values of θ that gradient descent went through. +When we run batch gradient descent to fit θ on our previous dataset, +to learn to predict housing price as a function of living area, we obtain +θ0 = 71.27, θ1 = 0.1345. If we plot hθ (x) as a function of x (area), along +with the training data, we obtain the following figure: +housing prices +1000 +900 +800 + +price (in $1000) + +700 +600 +500 +400 +300 +200 +100 +0 +500 + +1000 + +1500 + +2000 + +2500 +3000 +square feet + +3500 + +4000 + +4500 + +5000 + +If the number of bedrooms were included as one of the input features as well, +we get θ0 = 89.60, θ1 = 0.1392, θ2 = −8.738. +The above results were obtained with batch gradient descent. There is +an alternative to batch gradient descent that also works very well. Consider +the following algorithm: + + 7 +Loop { +for i=1 to m, { +} + +(i) + +θj := θj + α y (i) − hθ (x(i) ) xj + +(for every j). + +} +In this algorithm, we repeatedly run through the training set, and each time +we encounter a training example, we update the parameters according to +the gradient of the error with respect to that single training example only. +This algorithm is called stochastic gradient descent (also incremental +gradient descent). Whereas batch gradient descent has to scan through +the entire training set before taking a single step—a costly operation if m is +large—stochastic gradient descent can start making progress right away, and +continues to make progress with each example it looks at. Often, stochastic +gradient descent gets θ “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum, +and the parameters θ will keep oscillating around the minimum of J(θ); but +in practice most of the values near the minimum will be reasonably good +approximations to the true minimum.2 ) For these reasons, particularly when +the training set is large, stochastic gradient descent is often preferred over +batch gradient descent. + +2 + +The normal equations + +Gradient descent gives one way of minimizing J. Lets discuss a second way +of doing so, this time performing the minimization explicitly and without +resorting to an iterative algorithm. In this method, we will minimize J by +explicitly taking its derivatives with respect to the θj ’s, and setting them to +zero. To enable us to do this without having to write reams of algebra and +pages full of matrices of derivatives, lets introduce some notation for doing +calculus with matrices. +2 + +While it is more common to run stochastic gradient descent as we have described it +and with a fixed learning rate α, by slowly letting the learning rate α decrease to zero as +the algorithm runs, it is also possible to ensure that the parameters will converge to the +global minimum rather then merely oscillate around the minimum. + + 8 + +2.1 + +Matrix derivatives + +For a function f : Rm×n → R mapping from m-by-n matrices to the real +numbers, we define the derivative of f with respect to A to be: + + ∂f +∂f +· · · ∂A +∂A11 +1n + +..  +... +∇A f (A) =  ... +.  +∂f +∂Am1 + +··· + +∂f +∂Amn + +Thus, the gradient ∇A f (A) is itself an m-by-n matrix, whose (i, j)-element +A11 +A21 + +is ∂f /∂Aij . For example, suppose A = + +A12 +A22 + +is a 2-by-2 matrix, and + +the function f : R2×2 → R is given by + +3 +f (A) = A11 + 5A212 + A21 A22 . +2 +Here, Aij denotes the (i, j) entry of the matrix A. We then have +∇A f (A) = + +3 +2 + +A22 + +10A12 +A21 + +. + +We also introduce the trace operator, written “tr.” For an n-by-n +(square) matrix A, the trace of A is defined to be the sum of its diagonal +entries: +n +trA = + +Aii +i=1 + +If a is a real number (i.e., a 1-by-1 matrix), then tr a = a. (If you haven’t +seen this “operator notation” before, you should think of the trace of A as +tr(A), or as application of the “trace” function to the matrix A. It’s more +commonly written without the parentheses, however.) +The trace operator has the property that for two matrices A and B such +that AB is square, we have that trAB = trBA. (Check this yourself!) As +corollaries of this, we also have, e.g., +trABC = trCAB = trBCA, +trABCD = trDABC = trCDAB = trBCDA. +The following properties of the trace operator are also easily verified. Here, +A and B are square matrices, and a is a real number: +trA = trAT +tr(A + B) = trA + trB +tr aA = atrA + + 9 +We now state without proof some facts of matrix derivatives (we won’t +need some of these until later this quarter). Equation (4) applies only to +non-singular square matrices A, where |A| denotes the determinant of A. We +have: +∇A trAB +∇AT f (A) +∇A trABAT C +∇A |A| + += += += += + +BT +(∇A f (A))T +CAB + C T AB T +|A|(A−1 )T . + +(1) +(2) +(3) +(4) + +To make our matrix notation more concrete, let us now explain in detail the +meaning of the first of these equations. Suppose we have some fixed matrix +B ∈ Rn×m . We can then define a function f : Rm×n → R according to +f (A) = trAB. Note that this definition makes sense, because if A ∈ Rm×n , +then AB is a square matrix, and we can apply the trace operator to it; thus, +f does indeed map from Rm×n to R. We can then apply our definition of +matrix derivatives to find ∇A f (A), which will itself by an m-by-n matrix. +Equation (1) above states that the (i, j) entry of this matrix will be given by +the (i, j)-entry of B T , or equivalently, by Bji . +The proofs of Equations (1-3) are reasonably simple, and are left as an +exercise to the reader. Equations (4) can be derived using the adjoint representation of the inverse of a matrix.3 + +2.2 + +Least squares revisited + +Armed with the tools of matrix derivatives, let us now proceed to find in +closed-form the value of θ that minimizes J(θ). We begin by re-writing J in +matrix-vectorial notation. +Giving a training set, define the design matrix X to be the m-by-n +matrix (actually m-by-n + 1, if we include the intercept term) that contains +3 + +If we define A′ to be the matrix whose (i, j) element is (−1)i+j times the determinant +of the square matrix resulting from deleting row i and column j from A, then it can be +proved that A−1 = (A′ )T /|A|. (You can check that this is consistent with the standard +way of finding A−1 when A is a 2-by-2 matrix. If you want to see a proof of this more +general result, see an intermediate or advanced linear algebra text, such as Charles Curtis, +1991, Linear Algebra, Springer.) This shows that A′ = |A|(A−1 )T . Also, the determinant +of a matrix can be written |A| = j Aij A′ij . Since (A′ )ij does not depend on Aij (as can +be seen from its definition), this implies that (∂/∂Aij )|A| = A′ij . Putting all this together +shows the result. + + 10 +the training examples’ input values in its rows: + +— (x(1) )T — + — (x(2) )T — + +X= +.. + +. +(m) T +— (x ) — + + + + + +. + + +Also, let y be the m-dimensional vector containing all the target values from +the training set: + + +y (1) + y (2)  + + +y =  ..  . + .  +y (m) + +Now, since hθ (x(i) ) = (x(i) )T θ, we can easily verify + +  +(x(1) )T θ + +  +.. +Xθ − y =  +− +. +(x(m) )T θ + + + + +=  + +that + + +y (1) +..  +.  +(m) +y + + +hθ (x(1) ) − y (1) + +.. +. +. +(m) +(m) +hθ (x ) − y + +Thus, using the fact that for a vector z, we have that z T z = +1 +1 +(Xθ − y)T (Xθ − y) = +2 +2 + +2 +i zi : + +m + +i=1 + +(hθ (x(i) ) − y (i) )2 + += J(θ) +Finally, to minimize J, lets find its derivatives with respect to θ. Combining +Equations (2) and (3), we find that +∇AT trABAT C = B T AT C T + BAT C + +(5) + + 11 +Hence, +1 +∇θ J(θ) = ∇θ (Xθ − y)T (Xθ − y) +2 +1 +∇θ θT X T Xθ − θT X T y − y T Xθ + y T y += +2 +1 += +∇θ tr θT X T Xθ − θT X T y − y T Xθ + y T y +2 +1 += +∇θ tr θT X T Xθ − 2tr y T Xθ +2 +1 +X T Xθ + X T Xθ − 2X T y += +2 += X T Xθ − X T y +In the third step, we used the fact that the trace of a real number is just the +real number; the fourth step used the fact that trA = trAT , and the fifth +step used Equation (5) with AT = θ, B = B T = X T X, and C = I, and +Equation (1). To minimize J, we set its derivatives to zero, and obtain the +normal equations: +X T Xθ = X T y +Thus, the value of θ that minimizes J(θ) is given in closed form by the +equation +θ = (X T X)−1 X T y. + +3 + +Probabilistic interpretation + +When faced with a regression problem, why might linear regression, and +specifically why might the least-squares cost function J, be a reasonable +choice? In this section, we will give a set of probabilistic assumptions, under +which least-squares regression is derived as a very natural algorithm. +Let us assume that the target variables and the inputs are related via the +equation +y (i) = θT x(i) + ǫ(i) , +where ǫ(i) is an error term that captures either unmodeled effects (such as +if there are some features very pertinent to predicting housing price, but +that we’d left out of the regression), or random noise. Let us further assume +that the ǫ(i) are distributed IID (independently and identically distributed) +according to a Gaussian distribution (also called a Normal distribution) with + + 12 +mean zero and some variance σ 2 . We can write this assumption as “ǫ(i) ∼ +N (0, σ 2 ).” I.e., the density of ǫ(i) is given by +p(ǫ(i) ) = √ + +(ǫ(i) )2 +1 +exp − +2σ 2 +2πσ + +. + +This implies that +p(y (i) |x(i) ; θ) = √ + +1 +(y (i) − θT x(i) )2 +exp − +2σ 2 +2πσ + +. + +The notation “p(y (i) |x(i) ; θ)” indicates that this is the distribution of y (i) +given x(i) and parameterized by θ. Note that we should not condition on θ +(“p(y (i) |x(i) , θ)”), since θ is not a random variable. We can also write the +distribution of y (i) as as y (i) | x(i) ; θ ∼ N (θT x(i) , σ 2 ). +Given X (the design matrix, which contains all the x(i) ’s) and θ, what +is the distribution of the y (i) ’s? The probability of the data is given by +p(y|X; θ). This quantity is typically viewed a function of y (and perhaps X), +for a fixed value of θ. When we wish to explicitly view this as a function of +θ, we will instead call it the likelihood function: +L(θ) = L(θ; X, y) = p(y|X; θ). +Note that by the independence assumption on the ǫ(i) ’s (and hence also the +y (i) ’s given the x(i) ’s), this can also be written +m + +L(θ) = +i=1 +m + += +i=1 + +p(y (i) | x(i) ; θ) +(y (i) − θT x(i) )2 +1 +√ +exp − +2σ 2 +2πσ + +. + +Now, given this probabilistic model relating the y (i) ’s and the x(i) ’s, what +is a reasonable way of choosing our best guess of the parameters θ? The +principal of maximum likelihood says that we should should choose θ so +as to make the data as high probability as possible. I.e., we should choose θ +to maximize L(θ). +Instead of maximizing L(θ), we can also maximize any strictly increasing +function of L(θ). In particular, the derivations will be a bit simpler if we + + 13 +instead maximize the log likelihood ℓ(θ): +ℓ(θ) = log L(θ) +m + += log +i=1 + +√ + +(y (i) − θT x(i) )2 +1 +exp − +2σ 2 +2πσ + +m + += +i=1 + +log √ + += m log √ + +(y (i) − θT x(i) )2 +1 +exp − +2σ 2 +2πσ + +1 1 +1 +− 2· +2πσ σ 2 + +m + +i=1 + +(y (i) − θT x(i) )2 . + +Hence, maximizing ℓ(θ) gives the same answer as minimizing +1 +2 + +m + +i=1 + +(y (i) − θT x(i) )2 , + +which we recognize to be J(θ), our original least-squares cost function. +To summarize: Under the previous probabilistic assumptions on the data, +least-squares regression corresponds to finding the maximum likelihood estimate of θ. This is thus one set of assumptions under which least-squares regression can be justified as a very natural method that’s just doing maximum +likelihood estimation. (Note however that the probabilistic assumptions are +by no means necessary for least-squares to be a perfectly good and rational +procedure, and there may—and indeed there are—other natural assumptions +that can also be used to justify it.) +Note also that, in our previous discussion, our final choice of θ did not +depend on what was σ 2 , and indeed we’d have arrived at the same result +even if σ 2 were unknown. We will use this fact again later, when we talk +about the exponential family and generalized linear models. + +4 + +Locally weighted linear regression + +Consider the problem of predicting y from x ∈ R. The leftmost figure below +shows the result of fitting a y = θ0 + θ1 x to a dataset. We see that the data +doesn’t really lie on straight line, and so the fit is not very good. + + 14 +4.5 + +4.5 + +4 + +4 + +4.5 + +4 + +3.5 + +3.5 + +3.5 + +y + +3 + +2.5 + +y + +3 + +2.5 + +y + +3 + +2.5 + +2 + +2 + +2 + +1.5 + +1.5 + +1.5 + +1 + +1 + +1 + +0.5 + +0.5 + +0.5 + +0 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +0 + +7 + +0 + +1 + +x + +2 + +3 + +4 + +5 + +x + +6 + +7 + +0 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +x + +Instead, if we had added an extra feature x2 , and fit y = θ0 + θ1 x + θ2 x2 , +then we obtain a slightly better fit to the data. (See middle figure) Naively, it +might seem that the more features we add, the better. However, there is also +a danger in adding too many features: The rightmost figure is the result of +fitting a 5-th order polynomial y = 5j=0 θj xj . We see that even though the +fitted curve passes through the data perfectly, we would not expect this to +be a very good predictor of, say, housing prices (y) for different living areas +(x). Without formally defining what these terms mean, we’ll say the figure +on the left shows an instance of underfitting—in which the data clearly +shows structure not captured by the model—and the figure on the right is +an example of overfitting. (Later in this class, when we talk about learning +theory we’ll formalize some of these notions, and also define more carefully +just what it means for a hypothesis to be good or bad.) +As discussed previously, and as shown in the example above, the choice of +features is important to ensuring good performance of a learning algorithm. +(When we talk about model selection, we’ll also see algorithms for automatically choosing a good set of features.) In this section, let us talk briefly talk +about the locally weighted linear regression (LWR) algorithm which, assuming there is sufficient training data, makes the choice of features less critical. +This treatment will be brief, since you’ll get a chance to explore some of the +properties of the LWR algorithm yourself in the homework. +In the original linear regression algorithm, to make a prediction at a query +point x (i.e., to evaluate h(x)), we would: +1. Fit θ to minimize + +i (y + +(i) + +− θT x(i) )2 . + +2. Output θT x. +In contrast, the locally weighted linear regression algorithm does the following: +1. Fit θ to minimize +2. Output θT x. + +i + +w(i) (y (i) − θT x(i) )2 . + + 15 +Here, the w(i) ’s are non-negative valued weights. Intuitively, if w(i) is large +for a particular value of i, then in picking θ, we’ll try hard to make (y (i) − +θT x(i) )2 small. If w(i) is small, then the (y (i) − θT x(i) )2 error term will be +pretty much ignored in the fit. +A fairly standard choice for the weights is4 +w(i) = exp − + +(x(i) − x)2 +2τ 2 + +Note that the weights depend on the particular point x at which we’re trying +to evaluate x. Moreover, if |x(i) − x| is small, then w(i) is close to 1; and +if |x(i) − x| is large, then w(i) is small. Hence, θ is chosen giving a much +higher “weight” to the (errors on) training examples close to the query point +x. (Note also that while the formula for the weights takes a form that is +cosmetically similar to the density of a Gaussian distribution, the w(i) ’s do +not directly have anything to do with Gaussians, and in particular the w(i) +are not random variables, normally distributed or otherwise.) The parameter +τ controls how quickly the weight of a training example falls off with distance +of its x(i) from the query point x; τ is called the bandwidth parameter, and +is also something that you’ll get to experiment with in your homework. +Locally weighted linear regression is the first example we’re seeing of a +non-parametric algorithm. The (unweighted) linear regression algorithm +that we saw earlier is known as a parametric learning algorithm, because +it has a fixed, finite number of parameters (the θi ’s), which are fit to the +data. Once we’ve fit the θi ’s and stored them away, we no longer need to +keep the training data around to make future predictions. In contrast, to +make predictions using locally weighted linear regression, we need to keep +the entire training set around. The term “non-parametric” (roughly) refers +to the fact that the amount of stuff we need to keep in order to represent the +hypothesis h grows linearly with the size of the training set. + +4 + +If x is vector-valued, this is generalized to be w(i) = exp(−(x(i) − x)T (x(i) − x)/(2τ 2 )), +or w(i) = exp(−(x(i) − x)T Σ−1 (x(i) − x)/2), for an appropriate choice of τ or Σ. + + 16 + +Part II + +Classification and logistic +regression +Lets now talk about the classification problem. This is just like the regression +problem, except that the values y we now want to predict take on only +a small number of discrete values. For now, we will focus on the binary +classification problem in which y can take on only two values, 0 and 1. +(Most of what we say here will also generalize to the multiple-class case.) +For instance, if we are trying to build a spam classifier for email, then x(i) +may be some features of a piece of email, and y may be 1 if it is a piece +of spam mail, and 0 otherwise. 0 is also called the negative class, and 1 +the positive class, and they are sometimes also denoted by the symbols “-” +and “+.” Given x(i) , the corresponding y (i) is also called the label for the +training example. + +5 + +Logistic regression + +We could approach the classification problem ignoring the fact that y is +discrete-valued, and use our old linear regression algorithm to try to predict +y given x. However, it is easy to construct examples where this method +performs very poorly. Intuitively, it also doesn’t make sense for hθ (x) to take +values larger than 1 or smaller than 0 when we know that y ∈ {0, 1}. +To fix this, lets change the form for our hypotheses hθ (x). We will choose +hθ (x) = g(θT x) = +where + +1 +, +1 + e−θT x + +1 +1 + e−z +is called the logistic function or the sigmoid function. Here is a plot +showing g(z): +g(z) = + + 17 +1 + +0.9 + +0.8 + +0.7 + +g(z) + +0.6 + +0.5 + +0.4 + +0.3 + +0.2 + +0.1 + +0 +−5 + +−4 + +−3 + +−2 + +−1 + +0 +z + +1 + +2 + +3 + +4 + +5 + +Notice that g(z) tends towards 1 as z → ∞, and g(z) tends towards 0 as +z → −∞. Moreover, g(z), and hence also h(x), is always bounded between +0 and 1. As before, we are keeping the convention of letting x0 = 1, so that +θT x = θ0 + nj=1 θj xj . +For now, lets take the choice of g as given. Other functions that smoothly +increase from 0 to 1 can also be used, but for a couple of reasons that we’ll see +later (when we talk about GLMs, and when we talk about generative learning +algorithms), the choice of the logistic function is a fairly natural one. Before +moving on, here’s a useful property of the derivative of the sigmoid function, +which we write a g ′ : +d +1 +dz 1 + e−z +1 += +e−z +(1 + e−z )2 +1 +1 +· 1− += +−z +(1 + e ) +(1 + e−z ) += g(z)(1 − g(z)). + +g ′ (z) = + +So, given the logistic regression model, how do we fit θ for it? Following how we saw least squares regression could be derived as the maximum +likelihood estimator under a set of assumptions, lets endow our classification +model with a set of probabilistic assumptions, and then fit the parameters +via maximum likelihood. + + 18 +Let us assume that +P (y = 1 | x; θ) = hθ (x) +P (y = 0 | x; θ) = 1 − hθ (x) +Note that this can be written more compactly as +p(y | x; θ) = (hθ (x))y (1 − hθ (x))1−y +Assuming that the m training examples were generated independently, we +can then write down the likelihood of the parameters as +L(θ) = p(y | X; θ) +m + += +i=1 +m + +p(y (i) | x(i) ; θ) +hθ (x(i) ) + += +i=1 + +y (i) + +1 − hθ (x(i) ) + +1−y (i) + +As before, it will be easier to maximize the log likelihood: +ℓ(θ) = log L(θ) +m + += +i=1 + +y (i) log h(x(i) ) + (1 − y (i) ) log(1 − h(x(i) )) + +How do we maximize the likelihood? Similar to our derivation in the case +of linear regression, we can use gradient ascent. Written in vectorial notation, +our updates will therefore be given by θ := θ + α∇θ ℓ(θ). (Note the positive +rather than negative sign in the update formula, since we’re maximizing, +rather than minimizing, a function now.) Lets start by working with just +one training example (x, y), and take derivatives to derive the stochastic +gradient ascent rule: +∂ +ℓ(θ) = +∂θj + +∂ +1 +1 +− (1 − y) +g(θT x) +T +T +g(θ x) +1 − g(θ x) ∂θj +1 +1 +∂ T += +y +− (1 − y) +θ x +g(θT x)(1 − g(θT x) +T +T +g(θ x) +1 − g(θ x) +∂θj += y(1 − g(θT x)) − (1 − y)g(θT x) xj += (y − hθ (x)) xj +y + + 19 +Above, we used the fact that g ′ (z) = g(z)(1 − g(z)). This therefore gives us +the stochastic gradient ascent rule +(i) + +θj := θj + α y (i) − hθ (x(i) ) xj + +If we compare this to the LMS update rule, we see that it looks identical; but +this is not the same algorithm, because hθ (x(i) ) is now defined as a non-linear +function of θT x(i) . Nonetheless, it’s a little surprising that we end up with +the same update rule for a rather different algorithm and learning problem. +Is this coincidence, or is there a deeper reason behind this? We’ll answer this +when get get to GLM models. (See also the extra credit problem on Q3 of +problem set 1.) + +6 + +Digression: The perceptron learning algorithm + +We now digress to talk briefly about an algorithm that’s of some historical +interest, and that we will also return to later when we talk about learning +theory. Consider modifying the logistic regression method to “force” it to +output values that are either 0 or 1 or exactly. To do so, it seems natural to +change the definition of g to be the threshold function: +g(z) = + +1 if z ≥ 0 +0 if z < 0 + +If we then let hθ (x) = g(θT x) as before but using this modified definition of +g, and if we use the update rule +(i) + +θj := θj + α y (i) − hθ (x(i) ) xj . +then we have the perceptron learning algorithm. +In the 1960s, this “perceptron” was argued to be a rough model for how +individual neurons in the brain work. Given how simple the algorithm is, it +will also provide a starting point for our analysis when we talk about learning +theory later in this class. Note however that even though the perceptron may +be cosmetically similar to the other algorithms we talked about, it is actually +a very different type of algorithm than logistic regression and least squares +linear regression; in particular, it is difficult to endow the perceptron’s predictions with meaningful probabilistic interpretations, or derive the perceptron +as a maximum likelihood estimation algorithm. + + 20 + +Another algorithm for maximizing ℓ(θ) + +7 + +Returning to logistic regression with g(z) being the sigmoid function, lets +now talk about a different algorithm for minimizing ℓ(θ). +To get us started, lets consider Newton’s method for finding a zero of a +function. Specifically, suppose we have some function f : R → R, and we +wish to find a value of θ so that f (θ) = 0. Here, θ ∈ R is a real number. +Newton’s method performs the following update: +f (θ) +θ := θ − ′ . +f (θ) +This method has a natural interpretation in which we can think of it as +approximating the function f via a linear function that is tangent to f at +the current guess θ, solving for where that linear function equals to zero, and +letting the next guess for θ be where that linear function is zero. +Here’s a picture of the Newton’s method in action: +50 + +40 + +40 + +40 + +30 + +30 + +30 +f(x) + +60 + +50 + +f(x) + +60 + +50 + +f(x) + +60 + +20 + +20 + +20 + +10 + +10 + +10 + +0 + +0 + +0 + +−10 + +−10 + +−10 + +1 + +1.5 + +2 + +2.5 + +3 +x + +3.5 + +4 + +4.5 + +5 + +1 + +1.5 + +2 + +2.5 + +3 +x + +3.5 + +4 + +4.5 + +5 + +1 + +1.5 + +2 + +2.5 + +3 +x + +3.5 + +4 + +4.5 + +5 + +In the leftmost figure, we see the function f plotted along with the line +y = 0. We’re trying to find θ so that f (θ) = 0; the value of θ that achieves this +is about 1.3. Suppose we initialized the algorithm with θ = 4.5. Newton’s +method then fits a straight line tangent to f at θ = 4.5, and solves for the +where that line evaluates to 0. (Middle figure.) This give us the next guess +for θ, which is about 2.8. The rightmost figure shows the result of running +one more iteration, which the updates θ to about 1.8. After a few more +iterations, we rapidly approach θ = 1.3. +Newton’s method gives a way of getting to f (θ) = 0. What if we want to +use it to maximize some function ℓ? The maxima of ℓ correspond to points +where its first derivative ℓ′ (θ) is zero. So, by letting f (θ) = ℓ′ (θ), we can use +the same algorithm to maximize ℓ, and we obtain update rule: +ℓ′ (θ) +θ := θ − ′′ . +ℓ (θ) +(Something to think about: How would this change if we wanted to use +Newton’s method to minimize rather than maximize a function?) + + 21 +Lastly, in our logistic regression setting, θ is vector-valued, so we need to +generalize Newton’s method to this setting. The generalization of Newton’s +method to this multidimensional setting (also called the Newton-Raphson +method) is given by +θ := θ − H −1 ∇θ ℓ(θ). +Here, ∇θ ℓ(θ) is, as usual, the vector of partial derivatives of ℓ(θ) with respect +to the θi ’s; and H is an n-by-n matrix (actually, n + 1-by-n + 1, assuming +that we include the intercept term) called the Hessian, whose entries are +given by +∂ 2 ℓ(θ) +Hij = +. +∂θi ∂θj +Newton’s method typically enjoys faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to the +minimum. One iteration of Newton’s can, however, be more expensive than +one iteration of gradient descent, since it requires finding and inverting an +n-by-n Hessian; but so long as n is not too large, it is usually much faster +overall. When Newton’s method is applied to maximize the logistic regression log likelihood function ℓ(θ), the resulting method is also called Fisher +scoring. + + 22 + +Part III + +Generalized Linear Models5 +So far, we’ve seen a regression example, and a classification example. In the +regression example, we had y|x; θ ∼ N (µ, σ 2 ), and in the classification one, +y|x; θ ∼ Bernoulli(φ), where for some appropriate definitions of µ and φ as +functions of x and θ. In this section, we will show that both of these methods +are special cases of a broader family of models, called Generalized Linear +Models (GLMs). We will also show how other models in the GLM family +can be derived and applied to other classification and regression problems. + +8 + +The exponential family + +To work our way up to GLMs, we will begin by defining exponential family +distributions. We say that a class of distributions is in the exponential family +if it can be written in the form +p(y; η) = b(y) exp(η T T (y) − a(η)) + +(6) + +Here, η is called the natural parameter (also called the canonical parameter) of the distribution; T (y) is the sufficient statistic (for the distributions we consider, it will often be the case that T (y) = y); and a(η) is the log +partition function. The quantity e−a(η) essentially plays the role of a normalization constant, that makes sure the distribution p(y; η) sums/integrates +over y to 1. +A fixed choice of T , a and b defines a family (or set) of distributions that +is parameterized by η; as we vary η, we then get different distributions within +this family. +We now show that the Bernoulli and the Gaussian distributions are examples of exponential family distributions. The Bernoulli distribution with +mean φ, written Bernoulli(φ), specifies a distribution over y ∈ {0, 1}, so that +p(y = 1; φ) = φ; p(y = 0; φ) = 1 − φ. As we varying φ, we obtain Bernoulli +distributions with different means. We now show that this class of Bernoulli +distributions, ones obtained by varying φ, is in the exponential family; i.e., +that there is a choice of T , a and b so that Equation (6) becomes exactly the +class of Bernoulli distributions. +5 + +The presentation of the material in this section takes inspiration from Michael I. +Jordan, Learning in graphical models (unpublished book draft), and also McCullagh and +Nelder, Generalized Linear Models (2nd ed.). + + 23 +We write the Bernoulli distribution as: +p(y; φ) = φy (1 − φ)1−y += exp(y log φ + (1 − y) log(1 − φ)) +φ +y + log(1 − φ) . += exp +log +1−φ +Thus, the natural parameter is given by η = log(φ/(1 − φ)). Interestingly, if +we invert this definition for η by solving for φ in terms of η, we obtain φ = +1/(1 + e−η ). This is the familiar sigmoid function! This will come up again +when we derive logistic regression as a GLM. To complete the formulation +of the Bernoulli distribution as an exponential family distribution, we also +have +T (y) = y +a(η) = − log(1 − φ) += log(1 + eη ) +b(y) = 1 +This shows that the Bernoulli distribution can be written in the form of +Equation (6), using an appropriate choice of T , a and b. +Lets now move on to consider the Gaussian distribution. Recall that, +when deriving linear regression, the value of σ 2 had no effect on our final +choice of θ and hθ (x). Thus, we can choose an arbitrary value for σ 2 without +changing anything. To simplify the derivation below, lets set σ 2 = 1.6 We +then have: +1 +1 +p(y; µ) = √ exp − (y − µ)2 +2 +2π +1 +1 +1 += √ exp − y 2 · exp µy − µ2 +2 +2 +2π +6 + +If we leave σ 2 as a variable, the Gaussian distribution can also be shown to be in the +exponential family, where η ∈ R2 is now a 2-dimension vector that depends on both µ and +σ. For the purposes of GLMs, however, the σ 2 parameter can also be treated by considering +a more general definition of the exponential family: p(y; η, τ ) = b(a, τ ) exp((η T T (y) − +a(η))/c(τ )). Here, τ is called the dispersion parameter, and for the Gaussian, c(τ ) = σ 2 ; +but given our simplification above, we won’t need the more general definition for the +examples we will consider here. + + 24 +Thus, we see that the Gaussian is in the exponential family, with +η = µ +T (y) = y +a(η) = µ2 /2 += η 2 /2 +√ +b(y) = (1/ 2π) exp(−y 2 /2). +There’re many other distributions that are members of the exponential family: The multinomial (which we’ll see later), the Poisson (for modelling count-data; also see the problem set); the gamma and the exponential (for modelling continuous, non-negative random variables, such as timeintervals); the beta and the Dirichlet (for distributions over probabilities); +and many more. In the next section, we will describe a general “recipe” +for constructing models in which y (given x and θ) comes from any of these +distributions. + +9 + +Constructing GLMs + +Suppose you would like to build a model to estimate the number y of customers arriving in your store (or number of page-views on your website) in +any given hour, based on certain features x such as store promotions, recent +advertising, weather, day-of-week, etc. We know that the Poisson distribution usually gives a good model for numbers of visitors. Knowing this, how +can we come up with a model for our problem? Fortunately, the Poisson is an +exponential family distribution, so we can apply a Generalized Linear Model +(GLM). In this section, we will we will describe a method for constructing +GLM models for problems such as these. +More generally, consider a classification or regression problem where we +would like to predict the value of some random variable y as a function of +x. To derive a GLM for this problem, we will make the following three +assumptions about the conditional distribution of y given x and about our +model: +1. y | x; θ ∼ ExponentialFamily(η). I.e., given x and θ, the distribution of +y follows some exponential family distribution, with parameter η. +2. Given x, our goal is to predict the expected value of T (y) given x. +In most of our examples, we will have T (y) = y, so this means we +would like the prediction h(x) output by our learned hypothesis h to + + 25 +satisfy h(x) = E[y|x]. (Note that this assumption is satisfied in the +choices for hθ (x) for both logistic regression and linear regression. For +instance, in logistic regression, we had hθ (x) = p(y = 1|x; θ) = 0 · p(y = +0|x; θ) + 1 · p(y = 1|x; θ) = E[y|x; θ].) +3. The natural parameter η and the inputs x are related linearly: η = θT x. +(Or, if η is vector-valued, then ηi = θiT x.) +The third of these assumptions might seem the least well justified of +the above, and it might be better thought of as a “design choice” in our +recipe for designing GLMs, rather than as an assumption per se. These +three assumptions/design choices will allow us to derive a very elegant class +of learning algorithms, namely GLMs, that have many desirable properties +such as ease of learning. Furthermore, the resulting models are often very +effective for modelling different types of distributions over y; for example, we +will shortly show that both logistic regression and ordinary least squares can +both be derived as GLMs. + +9.1 + +Ordinary Least Squares + +To show that ordinary least squares is a special case of the GLM family +of models, consider the setting where the target variable y (also called the +response variable in GLM terminology) is continuous, and we model the +conditional distribution of y given x as as a Gaussian N (µ, σ 2 ). (Here, µ +may depend x.) So, we let the ExponentialF amily(η) distribution above be +the Gaussian distribution. As we saw previously, in the formulation of the +Gaussian as an exponential family distribution, we had µ = η. So, we have +hθ (x) = += += += + +E[y|x; θ] +µ +η +θT x. + +The first equality follows from Assumption 2, above; the second equality +follows from the fact that y|x; θ ∼ N (µ, σ 2 ), and so its expected value is given +by µ; the third equality follows from Assumption 1 (and our earlier derivation +showing that µ = η in the formulation of the Gaussian as an exponential +family distribution); and the last equality follows from Assumption 3. + + 26 + +9.2 + +Logistic Regression + +We now consider logistic regression. Here we are interested in binary classification, so y ∈ {0, 1}. Given that y is binary-valued, it therefore seems natural +to choose the Bernoulli family of distributions to model the conditional distribution of y given x. In our formulation of the Bernoulli distribution as +an exponential family distribution, we had φ = 1/(1 + e−η ). Furthermore, +note that if y|x; θ ∼ Bernoulli(φ), then E[y|x; θ] = φ. So, following a similar +derivation as the one for ordinary least squares, we get: +hθ (x) = E[y|x; θ] += φ += 1/(1 + e−η ) +T + += 1/(1 + e−θ x ) +T + +So, this gives us hypothesis functions of the form hθ (x) = 1/(1 + e−θ x ). If +you are previously wondering how we came up with the form of the logistic +function 1/(1 + e−z ), this gives one answer: Once we assume that y conditioned on x is Bernoulli, it arises as a consequence of the definition of GLMs +and exponential family distributions. +To introduce a little more terminology, the function g giving the distribution’s mean as a function of the natural parameter (g(η) = E[T (y); η]) +is called the canonical response function. Its inverse, g −1 , is called the +canonical link function. Thus, the canonical response function for the +Gaussian family is just the identify function; and the canonical response +function for the Bernoulli is the logistic function.7 + +9.3 + +Softmax Regression + +Lets look at one more example of a GLM. Consider a classification problem +in which the response variable y can take on any one of k values, so y ∈ +{1 2, . . . , k}. For example, rather than classifying email into the two classes +spam or not-spam—which would have been a binary classification problem— +we might want to classify it into three classes, such as spam, personal mail, +and work-related mail. The response variable is still discrete, but can now +take on more than two values. We will thus model it as distributed according +to a multinomial distribution. +7 + +Many texts use g to denote the link function, and g −1 to denote the response function; +but the notation we’re using here, inherited from the early machine learning literature, +will be more consistent with the notation used in the rest of the class. + + 27 +Lets derive a GLM for modelling this type of multinomial data. To do +so, we will begin by expressing the multinomial as an exponential family +distribution. +To parameterize a multinomial over k possible outcomes, one could use +k parameters φ1 , . . . , φk specifying the probability of each of the outcomes. +However, these parameters would be redundant, or more formally, they would +not be independent (since knowing any k − 1 of the φi ’s uniquely determines +k +the last one, as they must satisfy +i=1 φi = 1). So, we will instead parameterize the multinomial with only k − 1 parameters, φ1 , . . . , φk−1 , where +k−1 +φi = p(y = i; φ), and p(y = k; φ) = 1 − i=1 +φi . For notational convenience, +k−1 +we will also let φk = 1 − i=1 φi , but we should keep in mind that this is +not a parameter, and that it is fully specified by φ1 , . . . , φk−1 . +To express the multinomial as an exponential family distribution, we will +define T (y) ∈ Rk−1 as follows: + + + + + +T (1) =  + + + +1 +0 +0 +.. +. +0 + + + + + + + + + + + + , T (2) =  + + + + + +0 +1 +0 +.. +. +0 + + + + + + + + + + + + , T (3) =  + + + + + +0 +0 +1 +.. +. +0 + + + + + + + + + + + + , · · · , T (k−1) =  + + + + + +0 +0 +0 +.. +. +1 + + + + + + + + + + + + , T (k) =  + + + + + +0 +0 +0 +.. +. +0 + + + + + + +, + + + +Unlike our previous examples, here we do not have T (y) = y; also, T (y) is +now a k − 1 dimensional vector, rather than a real number. We will write +(T (y))i to denote the i-th element of the vector T (y). +We introduce one more very useful piece of notation. An indicator function 1{·} takes on a value of 1 if its argument is true, and 0 otherwise +(1{True} = 1, 1{False} = 0). For example, 1{2 = 3} = 0, and 1{3 = +5 − 2} = 1. So, we can also write the relationship between T (y) and y as +(T (y))i = 1{y = i}. (Before you continue reading, please make sure you understand why this is true!) Further, we have that E[(T (y))i ] = P (y = i) = φi . +We are now ready to show that the multinomial is a member of the + + 28 +exponential family. We have: +1{y=1} 1{y=2} +φ2 + +p(y; φ) = φ1 += + +1{y=k} + +· · · φk + +P +1− k−1 +1{y=i} +· · · φk i=1 +P +1− k−1 (T (y))i +(T (y))1 (T (y))2 +φ1 +φ2 +· · · φk i=1 +1{y=1} 1{y=2} +φ1 +φ2 + += += exp((T (y))1 log(φ1 ) + (T (y))2 log(φ2 ) + +k−1 +i=1 (T (y))i + +··· + 1 − + +log(φk )) + += exp((T (y))1 log(φ1 /φk ) + (T (y))2 log(φ2 /φk ) + +· · · + (T (y))k−1 log(φk−1 /φk ) + log(φk )) += b(y) exp(η T T (y) − a(η)) +where + + + + +η =  + + +log(φ1 /φk ) +log(φ2 /φk ) +.. +. +log(φk−1 /φk ) + +a(η) = − log(φk ) +b(y) = 1. + + + + + +, + + +This completes our formulation of the multinomial as an exponential family +distribution. +The link function is given (for i = 1, . . . , k) by +ηi = log + +φi +. +φk + +For convenience, we have also defined ηk = log(φk /φk ) = 0. To invert the +link function and derive the response function, we therefore have that +φi +φk += φi + +eηi = +φk eηi +k + +(7) + +k + +eηi = + +φk +i=1 + +φi = 1 +i=1 + +This implies that φk = 1/ ki=1 eηi , which can be substituted back into Equation (7) to give the response function +φi = + +eηi +k +ηj +j=1 e + + 29 +This function mapping from the η’s to the φ’s is called the softmax function. +To complete our model, we use Assumption 3, given earlier, that the ηi ’s +are linearly related to the x’s. So, have ηi = θiT x (for i = 1, . . . , k − 1), +where θ1 , . . . , θk−1 ∈ Rn+1 are the parameters of our model. For notational +convenience, we can also define θk = 0, so that ηk = θkT x = 0, as given +previously. Hence, our model assumes that the conditional distribution of y +given x is given by +p(y = i|x; θ) = φi += + +eηi +k +ηj +j=1 e +T + += + +eθi x + +(8) + +k +θjT x +j=1 e + +This model, which applies to classification problems where y ∈ {1, . . . , k}, is +called softmax regression. It is a generalization of logistic regression. +Our hypothesis will output +hθ (x) = E[T (y)|x; θ] + +1{y = 1} + 1{y = 2} + += E +.. + +. +1{y = k − 1} + + +φ1 + φ2  + + +=  ..  + .  +φk−1 + + +exp(θT x) + + + +x; θ + + +1 + +Pk + + + + +=  + + + + + + +exp(θjT x) +exp(θ2T x) +Pk +T +j=1 exp(θj x) +j=1 + +.. +. + +T +exp(θk−1 +x) +Pk +T +j=1 exp(θj x) + + + + +. + + + + +In other words, our hypothesis will output the estimated probability that +p(y = i|x; θ), for every value of i = 1, . . . , k. (Even though hθ (x) as defined +above is only k − 1 dimensional, clearly p(y = k|x; θ) can be obtained as +k−1 +1 − i=1 +φi .) + + 30 +Lastly, lets discuss parameter fitting. Similar to our original derivation of +ordinary least squares and logistic regression, if we have a training set of m +examples {(x(i) , y (i) ); i = 1, . . . , m} and would like to learn the parameters θi +of this model, we would begin by writing down the log-likelihood +m + +ℓ(θ) = +i=1 + +log p(y (i) |x(i) ; θ) + +m + += + +k + +log +i=1 + +l=1 + +T x(i) + +eθl + +1{y (i) =l} + +k +θjT x(i) +j=1 e + +To obtain the second line above, we used the definition for p(y|x; θ) given +in Equation (8). We can now obtain the maximum likelihood estimate of +the parameters by maximizing ℓ(θ) in terms of θ, using a method such as +gradient ascent or Newton’s method. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes10.txt b/Lectures/aimlcs229/cs229-notes10.txt new file mode 100644 index 0000000..517095b --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes10.txt @@ -0,0 +1,920 @@ +CS229 Lecture notes +Andrew Ng + +Part XI + +Principal components analysis +In our discussion of factor analysis, we gave a way to model data x ∈ Rn as +“approximately” lying in some k-dimension subspace, where k +n. Specif(i) +ically, we imagined that each point x was created by first generating some +z (i) lying in the k-dimension affine space {Λz + µ; z ∈ Rk }, and then adding +Ψ-covariance noise. Factor analysis is based on a probabilistic model, and +parameter estimation used the iterative EM algorithm. +In this set of notes, we will develop a method, Principal Components +Analysis (PCA), that also tries to identify the subspace in which the data +approximately lies. However, PCA will do so more directly, and will require +only an eigenvector calculation (easily done with the eig function in Matlab), +and does not need to resort to EM. +Suppose we are given dataset {x(i) ; i = 1, . . . , m} of attributes of m different types of automobiles, such as their maximum speed, turn radius, and +so on. Lets x(i) ∈ Rn for each i (n +m). But unknown to us, two different +attributes—some xi and xj —respectively give a car’s maximum speed measured in miles per hour, and the maximum speed measured in kilometers per +hour. These two attributes are therefore almost linearly dependent, up to +only small differences introduced by rounding off to the nearest mph or kph. +Thus, the data really lies approximately on an n − 1 dimensional subspace. +How can we automatically detect, and perhaps remove, this redundancy? +For a less contrived example, consider a dataset resulting from a survey of +(i) +pilots for radio-controlled helicopters, where x1 is a measure of the piloting +(i) +skill of pilot i, and x2 captures how much he/she enjoys flying. Because +RC helicopters are very difficult to fly, only the most committed students, +ones that truly enjoy flying, become good pilots. So, the two attributes +x1 and x2 are strongly correlated. Indeed, we might posit that that the +1 + + 2 +data actually likes along some diagonal axis (the u1 direction) capturing the +intrinsic piloting “karma” of a person, with only a small amount of noise +lying off this axis. (See figure.) How can we automatically compute this u1 +direction? + +x2 (enjoyment) + +u1 +u2 + +x1 (skill) +We will shortly develop the PCA algorithm. But prior to running PCA +per se, typically we first pre-process the data to normalize its mean and +variance, as follows: +1. Let µ = + +1 +m + +m +i=1 + +x(i) . + +2. Replace each x(i) with x(i) − µ. +3. Let σj2 = + +1 +m + +(i) 2 +i (xj ) +(i) + +(i) + +4. Replace each xj with xj /σj . +Steps (1-2) zero out the mean of the data, and may be omitted for data +known to have zero mean (for instance, time series corresponding to speech +or other acoustic signals). Steps (3-4) rescale each coordinate to have unit +variance, which ensures that different attributes are all treated on the same +“scale.” For instance, if x1 was cars’ maximum speed in mph (taking values +in the high tens or low hundreds) and x2 were the number of seats (taking +values around 2-4), then this renormalization rescales the different attributes +to make them more comparable. Steps (3-4) may be omitted if we had +apriori knowledge that the different attributes are all on the same scale. One + + 3 +example of this is if each data point represented a grayscale image, and each +(i) +xj took a value in {0, 1, . . . , 255} corresponding to the intensity value of +pixel j in image i. +Now, having carried out the normalization, how do we compute the “major axis of variation” u—that is, the direction on which the data approximately lies? One way to pose this problem is as finding the unit vector u so +that when the data is projected onto the direction corresponding to u, the +variance of the projected data is maximized. Intuitively, the data starts off +with some amount of variance/information in it. We would like to choose a +direction u so that if we were to approximate the data as lying in the direction/subspace corresponding to u, as much as possible of this variance is still +retained. +Consider the following dataset, on which we have already carried out the +normalization steps: + +Now, suppose we pick u to correspond the the direction shown in the +figure below. The circles denote the projections of the original data onto this +line. + + 4 + +✓✓ ✟✔✓✔✓ +✔✄ +✔✓✔✓ ✞✟✞✔✓✔✓ +✞✞✄✔✄ +✄ + +☞✄☛☞✄☛ ☞☛☞☛ ✠✡ + +✎✏✄ +✏✄ +✏✎✏✎✎ ✒✄ +✎✏✄ +✑✒✄ +✑✒✄ +✒✄ +✒✑✒✑✑ +✎ ☎✂ ✒✄ +✏ ✆✝ ☎✂ ✑✒✄ +✑✒✄ +✑✒✄ +✑✒✄ +✑✒✄ +✑✒✄ +✑✒✄ +✒✒✑✑ +✑✒✄ +✑ +✑ +✑✒✄ +✑✒✄ +✑✒✄ +✄ +✒ +✄ +✒ +✑✒✄ +✑ +✑ +✑ ✒✄ +✑ ✒✄ +✑ ✒✒✑✒✑ + +✁ ✁  ✍✄ +✍✌✍✍✌ +✌✍✄ +✌✍✄ +✍✄ +✌✍✄ +✍✌✍✌✍✌ +✌✍✄ +✌✌✍✄ +✌ ✌✍✌ + +We see that the projected data still has a fairly large variance, and the +points tend to be far from zero. In contrast, suppose had instead picked the +following direction: + +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +✩✄ +✩✄ +✩✄ +✩✄ +✩✄ +✩✄ +✩✄ +✩★✩★✩★ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩★✄ +★ +★ +★ +★ +★ +★ +★ +★ +★ +★ +★ +★ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✩★✩★✩★ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +✩★✄ +✩✄ +✩✄ +✩✄ +✩✄ +✩✄ +✩✄ +★ +★ +★ +★ +★ +★ +★ +✄ +✩ +★ +★ +★ +★ +★ +★ +★ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✩★✩★✩★ +✬✭✄ +✬✭✄ +✬✭✄ +✬✭✄ +✭✄ +✭✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +✩★✩★✩★ ✭✄ +✩★✩★✩★ ✭✄ +✩★✩★✩★ ✭✬✭✬✭✬ ✩✄ +✩✄ +✩✄ +✩✄ +✬✭✬✄ +✬ +✬ +✬ +★ +★ +★ +★ +✪ +✪ +✪ +✪ +✪✫✄ +✪✫✄ +✪✫✄ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✫✄ +✫✄ +✫✪✫✪✪✫ +✬ +✬ +✬ +✄ +✭ +✄ +✭ +✄ +✭ +★ +★ +★ +★ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✩★✩★✩★ ✫✄ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✬ +✬ +✬ +✬ +✭✬✄ +✄ +✭ +✄ +✭ +✄ +✭ +✭ +★ +★ +★ +★ +★ +★ +★ +✩ +✩ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✪ +✄ +✪ +✄ +✪ +✄ +✪ +✄ +✪ +✄ +✪ +✄ +✪ +✫ +✫ +✫ +✫ +✫ +✫ +✫ +✬ +✬ +✬ +✬ +✭✖✕ ✬✄ +✄ +✭ +✄ +✭ +✄ +✭ +✭ +★ +★ +★ +★ +★ +★ +★ +✩ +✩ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✄ +✩ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✫✄ +✫✄ +✫✪✫✪✫✪ +✬✭✄ +✬✭✄ +✬✭✄ +✭✬✄ +★✩✄ +★✩✄ +★✩✄ +★✩✄ +✩✄ +✩✄ +✩✄ +✩★✩★ ✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✬ ✗✄✥✤✥✤ ✖✕ ✙✚✭✄ +✬ ✩★✩★ ✗✘ ✭✄ +✬ ✩★✩★ ✭✄ +✬ ✩★✩★ ✭✬✭✬ ✩✄ +✭✄ +★ ✫✄ +★ ✫✄ +★ ✫✄ +★ ✫✄ +✤✥✄ +✤ +✤ +✤ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄ +✄ +✥ +✄ +✥ +✄ +✥ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✤ +✤ +✤ +✤ +✄ +✥ +✄ +✥ +✄ +✥ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✫✪✫✪✫✪ +✤✥✄ +✤✥✄ +✤✥✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✥✄ +✥✤✄ +✥✄ +✥✤✥✤✥✤ +✤✥✄ +✤ +✤ +✤ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✥✄✤✥✄✤✥✄✤ ✥✄✤✥✤✄✥✤✄✥✄✤✤✄✥✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄ +✄ +✥ +✤ +✤ +✤ +✤ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✄ +✥ +✄ +✥ +✄ +✥ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✫✪✫✪✫✪ +✤✥✄ +✤✥✄ +✤✥✤✤ ✧✄ +✤✥✤✤ ✧✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✪✫✄ +✥✄ +✥✄ +✥✄ +✥✤✥✤✥✤ ✧✄ +✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✫✄ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✤✥✄ +✤ +✥✄✤✥✄✤✥✄✤ ✥✤✄✥✤✄✥✤✄✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄ +✦ +✦ +✦ +✦ +✦ +✦ +✄ +✧ +✄ +✧ +✧ +✜ +✣ +✢ +✤ +✤ +✪ +✪ +✪ +✪ +✪ +✪ +✪ +✄ +✥ +✥ +✥ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✄ +✫ +✦ +✦ +✦ +✦ +✦ +✦ +✄ +✧ +✄ +✧ +✄ +✧ +✄ +✧ +✄ +✧ +✧ +✛✄ +✛✜✛ ✫✄ +✤✥✄ +✤✥✄ +✤✥✄ +✤✥✄ +✤✥✤✤ ✧✄ +✪ ✫✄ +✪ ✫✄ +✪ ✫✄ +✪ ✫✄ +✪ ✫✄ +✪ ✫✄ +✪ ✫✪✫✪ +✄ +✥ +✥ +✥ +✥ +✦ +✦ +✦ +✦ +✦ +✦ +✄ +✧ +✄ +✧ +✄ +✧ +✄ +✧ +✧ +✄ +✛ +✤✤✄ +✤ +✤ +✤ +✥✄✤✤✄✥✥✄✤ ✥✤✄✤✄✥✥✄✤ ✥✄✤✤✄✥✥✄✤ ✥✄✤✤✄✥✥✄✤ ✥✄✤✤✄✥✥✄✤ ✥✄✤✤✄✥✥✄✤ ✥✄ +✦ +✦ +✦ +✦ +✦ +✦ +✧ +✧ +✄ +✧ +✄ +✧ +✄ +✧ +✧ +✄ +✤ +✄ +✤ +✄ +✤ +✥✥✄ +✥✥✤✥✤ ✧✦✧✄ +✥✥✤✥✤ ✧✄ +✥✥✤✥✤ ✧✄ +✦✧✄ +✦✧✄ +✦✧✄ +✧✦✧✄ +✧✄ +✧✦✧✦✧✦ +✤✥✄ +✤✥✄ +✦✧✄ +✦✧✄ +✦✧✄ +✦✧✄ +✦✧✄ +✤✥✄ +✤ +✥✤✄✥✤✄✥✤✄✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✄✤✥✄✤✥✄✤ ✥✥✄ +✦ +✦ +✦ +✦ +✦ +✤ +✤ +✤ +✤ +✤ +✄ +✥ +✥ +✥ +✥ +✦ +✦ +✦ +✦ +✦ +✄ +✧ +✄ +✧ +✄ +✧ +✄ +✧ +✄ +✧ +✧✦ +✤✥✄ +✤✥✄ +✥✤ ✦✧✄ +✥✤ ✦✧✄ +✥✤ ✦✧✄ +✦✧✄ +✦✧✄ +✤✥✄ +✤✥✄ +✥✤✄✥✄✤ ✥✄✤✥✄✤ ✥✄✤✥✄✤ ✥✄✤✥✄✤ ✥✄✤✥✄✤ ✥✄✤✥✄✤ ✥✄ +✦✧✄ +✦✧✄ +✦✧✄ +✦✧✄ +✦✧✄ +✤ ✥✄ +✤ ✧✄ +✦ ✥✤✥✤ ✧✄ +✦ ✥✤✥✤ ✧✄ +✦ ✥✤✥✤ ✧✄ +✦ ✧✄ +✦ ✧✦✧✦✧✦ + +Here, the projections have a significantly smaller variance, and are much +closer to the origin. +We would like to automatically select the direction u corresponding to +the first of the two figures shown above. To formalize this, note that given a + + 5 +unit vector u and a point x, the length of the projection of x onto u is given +by xT u. I.e., if x(i) is a point in our dataset (one of the crosses in the plot), +then its projection onto u (the corresponding circle in the figure) is distance +xT u from the origin. Hence, to maximize the variance of the projections, we +would like to choose a unit-length u so as to maximize: +1 +m + +m + +(x +i=1 + +(i) T + +u) + +2 + +1 += +m += u + +T + +m +T + +uT x(i) x(i) u +i=1 + +1 +m + +m + +x(i) x(i) + +T + +u. + +i=1 + +We easily recognize that the maximizing this subject to ||u||2 = 1 gives the +(i) (i) T +, which is just the empirical +principal eigenvector of Σ = m1 m +i=1 x x +covariance matrix of the data (assuming it has zero mean).1 +To summarize, we have found that if we wish to find a 1-dimensional +subspace with with to approximate the data, we should choose u to be the +principal eigenvector of Σ. More generally, if we wish to project our data +into a k-dimensional subspace (k < n), we should choose u1 , . . . , uk to be the +top k eigenvectors of Σ. The ui ’s now form a new, orthogonal basis for the +data.2 +Then, to represent x(i) in this basis, we need only compute the corresponding vector + + +uT1 x(i) + uT x(i)  + 2 + +(i) +y = + ∈ Rk . +.. + + +. +T (i) +uk x + +Thus, whereas x(i) ∈ Rn , the vector y (i) now gives a lower, k-dimensional, +approximation/representation for x(i) . PCA is therefore also referred to as +a dimensionality reduction algorithm. The vectors u1 , . . . , uk are called +the first k principal components of the data. +Remark. Although we have shown it formally only for the case of k = 1, +using well-known properties of eigenvectors it is straightforward to show that +1 + +If you haven’t seen this before, try using the method of Lagrange multipliers to maximize uT Σu subject to that uT u = 1. You should be able to show that Σu = λu, for some +λ, which implies u is an eigenvector of Σ, with eigenvalue λ. +2 +Because Σ is symmetric, the ui ’s will (or always can be chosen to be) orthogonal to +each other. + + 6 +of all possible orthogonal bases u1 , . . . , uk , the one that we have chosen maximizes i ||y (i) ||22 . Thus, our choice of a basis preserves as much variability +as possible in the original data. +In problem set 4, you will see that PCA can also be derived by picking +the basis that minimizes the approximation error arising from projecting the +data onto the k-dimensional subspace spanned by them. +PCA has many applications, our discussion with a small number of examples. First, compression—representing x(i) ’s with lower dimension y (i) ’s—is +an obvious application. If we reduce high dimensional data to k = 2 or 3 dimensions, then we can also plot the y (i) ’s to visualize the data. For instance, +if we were to reduce our automobiles data to 2 dimensions, then we can plot +it (one point in our plot would correspond to one car type, say) to see what +cars are similar to each other and what groups of cars may cluster together. +Another standard application is to preprocess a dataset to reduce its +dimension before running a supervised learning learning algorithm with the +x(i) ’s as inputs. Apart from computational benefits, reducing the data’s +dimension can also reduce the complexity of the hypothesis class considered +and help avoid overfitting (e.g., linear classifiers over lower dimensional input +spaces will have smaller VC dimension). +Lastly, as in our RC pilot example, we can also view PCA as a noise +reduction algorithm. In our example it, estimates the intrinsic “piloting +karma” from the noisy measures of piloting skill and enjoyment. In class, we +also saw the application of this idea to face images, resulting in eigenfaces +method. Here, each point x(i) ∈ R100×100 was a 10000 dimensional vector, +with each coordinate corresponding to a pixel intensity value in a 100x100 +image of a face. Using PCA, we represent each image x(i) with a much lowerdimensional y (i) . In doing so, we hope that the principal components we +found retain the interesting, systematic variations between faces that capture +what a person really looks like, but not the “noise” in the images introduced +by minor lighting variations, slightly different imaging conditions, and so on. +We then measure distances between faces i and j by working in the reduced +dimension, and computing ||y (i) − y (j) ||2 . This resulted in a surprisingly good +face-matching and retrieval algorithm. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes11.txt b/Lectures/aimlcs229/cs229-notes11.txt new file mode 100644 index 0000000..3dbcd76 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes11.txt @@ -0,0 +1,269 @@ +CS229 Lecture notes +Andrew Ng + +Part XII + +Independent Components +Analysis +Our next topic is Independent Components Analysis (ICA). Similar to PCA, +this will find a new basis in which to represent our data. However, the goal +is very different. +As a motivating example, consider the “cocktail party problem.” Here, n +speakers are speaking simultaneously at a party, and any microphone placed +in the room records only an overlapping combination of the n speakers’ voices. +But lets say we have n different microphones placed in the room, and because +each microphone is a different distance from each of the speakers, it records a +different combination of the speakers’ voices. Using these microphone recordings, can we separate out the original n speakers’ speech signals? +To formalize this problem, we imagine that there is some data s ∈ Rn +that is generated via n independent sources. What we observe is +x = As, +where A is an unknown square matrix called the mixing matrix. Repeated +observations gives us a dataset {x(i) ; i = 1, . . . , m}, and our goal is to recover +the sources s(i) that had generated our data (x(i) = As(i) ). +(i) +In our cocktail party problem, s(i) is an n-dimensional vector, and sj is +the sound that speaker j was uttering at time i. Also, x(i) in an n-dimensional +(i) +vector, and xj is the acoustic reading recorded by microphone j at time i. +Let W = A−1 be the unmixing matrix. Our goal is to find W , so +that given our microphone recordings x(i) , we can recover the sources by +computing s(i) = W x(i) . For notational convenience, we also let wiT denote +1 + + 2 +the i-th row of W , so that + +— w1T — + + +.. +W = +. +. +— wnT — + + +(i) + +Thus, wi ∈ Rn , and the j-th source can be recovered by computing sj = +wjT x(i) . + +1 + +ICA ambiguities + +To what degree can W = A−1 be recovered? If we have no prior knowledge +about the sources and the mixing matrix, it is not hard to see that there are +some inherent ambiguities in A that are impossible to recover, given only the +x(i) ’s. +Specifically, let P be any n-by-n permutation matrix. This means that +each row and each column of P has exactly one “1.” Here’re some examples +of permutation matrices: + + +0 1 0 +0 1 +1 0 +P =  1 0 0 ; P = +; P = +. +1 0 +0 1 +0 0 1 + +If z is a vector, then P z is another vector that’s contains a permuted version +of z’s coordinates. Given only the x(i) ’s, there will be no way to distinguish +between W and P W . Specifically, the permutation of the original sources is +ambiguous, which should be no surprise. Fortunately, this does not matter +for most applications. +Further, there is no way to recover the correct scaling of the wi ’s. For instance, if A were replaced with 2A, and every s(i) were replaced with (0.5)s(i) , +then our observed x(i) = 2A · (0.5)s(i) would still be the same. More broadly, +if a single column of A were scaled by a factor of α, and the corresponding +source were scaled by a factor of 1/α, then there is again no way, given only +the x(i) ’s to determine that this had happened. Thus, we cannot recover the +“correct” scaling of the sources. However, for the applications that we are +concerned with—including the cocktail party problem—this ambiguity also +(i) +does not matter. Specifically, scaling a speaker’s speech signal sj by some +positive factor α affects only the volume of that speaker’s speech. Also, sign +(i) +(i) +changes do not matter, and sj and −sj sound identical when played on a +speaker. Thus, if the wi found by an algorithm is scaled by any non-zero real + + 3 +number, the corresponding recovered source si = wiT x will be scaled by the +same factor; but this usually does not matter. (These comments also apply +to ICA for the brain/MEG data that we talked about in class.) +Are these the only sources of ambiguity in ICA? It turns out that they +are, so long as the sources si are non-Gaussian. To see what the difficulty is +with Gaussian data, consider an example in which n = 2, and s ∼ N (0, I). +Here, I is the 2x2 identity matrix. Note that the contours of the density of +the standard normal distribution N (0, I) are circles centered on the origin, +and the density is rotationally symmetric. +Now, suppose we observe some x = As, where A is our mixing matrix. +The distribution of x will also be Gaussian, with zero mean and covariance +E[xxT ] = E[AssT AT ] = AAT . Now, let R be an arbitrary orthogonal (less +formally, a rotation/reflection) matrix, so that RR T = RT R = I, and let +A = AR. Then if the data had been mixed according to A instead of +A, we would have instead observed x = A s. The distribution of x is +also Gaussian, with zero mean and covariance E[x (x )T ] = E[A ssT (A )T ] = +E[ARssT (AR)T ] = ARRT AT = AAT . Hence, whether the mixing matrix +is A or A , we would observe data from a N (0, AAT ) distribution. Thus, +there is no way to tell if the sources were mixed using A and A . So, there +is an arbitrary rotational component in the mixing matrix that cannot be +determined from the data, and we cannot recover the original sources. +Our argument above was based on the fact that the multivariate standard +normal distribution is rotationally symmetric. Despite the bleak picture that +this paints for ICA on Gaussian data, it turns out that, so long as the data is +not Gaussian, it is possible, given enough data, to recover the n independent +sources. + +2 + +Densities and linear transformations + +Before moving on to derive the ICA algorithm proper, we first digress briefly +to talk about the effect of linear transformations on densities. +Suppose we have a random variable s drawn according to some density +ps (s). For simplicity, let us say for now that s ∈ R is a real number. Now, let +the random variable x be defined according to x = As (here, x ∈ R, A ∈ R). +Let px be the density of x. What is px ? +Let W = A−1 . To calculate the “probability” of a particular value of x, +it is tempting to compute s = W x, then then evaluate ps at that point, and +conclude that “px (x) = ps (W x).” However, this is incorrect. For example, +let s ∼ Uniform[0, 1], so that s’s density is ps (s) = 1{0 ≤ s ≤ 1}. Now, let + + 4 +A = 2, so that x = 2s. Clearly, x is distributed uniformly in the interval +[0, 2]. Thus, its density is given by px (x) = (0.5)1{0 ≤ x ≤ 2}. This does +not equal ps (W x), where W = 0.5 = A−1 . Instead, the correct formula is +px (x) = ps (W x)|W |. +More generally, if s is a vector-valued distribution with density ps , and +x = As for a square, invertible matrix A, then the density of x is given by +px (x) = ps (W x) · |W |, +where W = A−1 . +Remark. If you’re seen the result that A maps [0, 1]n to a set of volume |A|, +then here’s another way to remember the formula for px given above, that also +generalizes our previous 1-dimensional example. Specifically, let A ∈ Rn×n be +given, and let W = A−1 as usual. Also let C1 = [0, 1]n be the n-dimensional +hypercube, and define C2 = {As : s ∈ C1 } ⊆ Rn to be the image of C1 +under the mapping given by A. Then it is a standard result in linear algebra +(and, indeed, one of the ways of defining determinants) that the volume of +C2 is given by |A|. Now, suppose s is uniformly distributed in [0, 1]n , so its +density is ps (s) = 1{s ∈ C1 }. Then clearly x will be uniformly distributed +in C2 . Its density is therefore found to be px (x) = 1{x ∈ C2 }/vol(C2 ) (since +it must integrate over C2 to 1). But using the fact that the determinant +of the inverse of a matrix is just the inverse of the determinant, we have +1/vol(C2 ) = 1/|A| = |A−1 | = |W |. Thus, px (x) = 1{x ∈ C2 }|W | = 1{W x ∈ +C1 }|W | = ps (W x)|W |. + +3 + +ICA algorithm + +We are now ready to derive an ICA algorithm. The algorithm we describe +is due to Bell and Sejnowski, and the interpretation we give will be of their +algorithm as a method for maximum likelihood estimation. (This is different +from their original interpretation, which involved a complicated idea called +the infomax principal, that is no longer necessary in the derivation given the +modern understanding of ICA.) +We suppose that the distribution of each source si is given by a density +ps , and that the joint distribution of the sources s is given by +n + +p(s) = + +ps (si ). +i=1 + +Note that by modeling the joint distribution as a product of the marginal, +we capture the assumption that the sources are independent. Using our + + 5 +formulas from the previous section, this implies the following density on +x = As = W −1 s: +n +ps (wiT x) · |W |. + +p(x) = +i=1 + +All that remains is to specify a density for the individual sources ps . +Recall that, given a real-valued random variable z, its cumulative distriz0 +bution function (cdf) F is defined by F (z0 ) = P (z ≤ z0 ) = −∞ +pz (z)dz. +Also, the density of z can be found from the cdf by taking its derivative: +pz (z) = F (z). +Thus, to specify a density for the si ’s, all we need to do is to specify some +cdf for it. A cdf has to be a monotonic function that increases from zero +to one. Following our previous discussion, we cannot choose the cdf to be +the cdf of the Gaussian, as ICA doesn’t work on Gaussian data. What we’ll +choose instead for the cdf, as a reasonable “default” function that slowly +increases from 0 to 1, is the sigmoid function g(s) = 1/(1 + e−s ). Hence, +ps (s) = g (s).1 +The square matrix W is the parameter in our model. Given a training +set {x(i) ; i = 1, . . . , m}, the log likelihood is given by +m + +n + +log g (wjT x(i) ) + log |W | . + +(W ) = +i=1 + +j=1 + +We would like to maximize this in terms W . By taking derivatives and using +the fact (from the first set of notes) that ∇W |W | = |W |(W −1 )T , we easily +derive a stochastic gradient ascent learning rule. For a training example x(i) , +the update rule is: + + + +1 − 2g(w1T x(i) ) + 1 − 2g(wT x(i) )  + +2 + (i) T + + ++ (W T )−1  , +W := W + α  +x +.. + + + +. +T (i) +1 − 2g(wn x ) +1 + +If you have prior knowledge that the sources’ densities take a certain form, then it +is a good idea to substitute that in here. But in the absence of such knowledge, the +sigmoid function can be thought of as a reasonable default that seems to work well for +many problems. Also, the presentation here assumes that either the data x (i) has been +preprocessed to have zero mean, or that it can naturally be expected to have zero mean +(such as acoustic signals). This is necessary because our assumption that p s (s) = g (s) +implies E[s] = 0 (the derivative of the logistic function is a symmetric function, and +hence gives a density corresponding to a random variable with zero mean), which implies +E[x] = E[As] = 0. + + 6 +where α is the learning rate. +After the algorithm converges, we then compute s(i) = W x(i) to recover +the original sources. +Remark. When writing down the likelihood of the data, we implicity assumed that the x(i) ’s were independent of each other (for different values +of i; note this issue is different from whether the different coordinates of +x(i) are independent), so that the likelihood of the training set was given by +(i) +i p(x ; W ). This assumption is clearly incorrect for speech data and other +time series where the x(i) ’s are dependent, but it can be shown that having +correlated training examples will not hurt the performance of the algorithm +if we have sufficient data. But, for problems where successive training examples are correlated, when implementing stochastic gradient ascent, it also +sometimes helps accelerate convergence if we visit training examples in a randomly permuted order. (I.e., run stochastic gradient ascent on a randomly +shuffled copy of the training set.) + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes12.txt b/Lectures/aimlcs229/cs229-notes12.txt new file mode 100644 index 0000000..24fed58 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes12.txt @@ -0,0 +1,338 @@ +CS229 Lecture notes +Andrew Ng + +Part XIII + +Reinforcement Learning and +Control +We now begin our study of reinforcement learning and adaptive control. +In supervised learning, we saw algorithms that tried to make their outputs +mimic the labels y given in the training set. In that setting, the labels gave +an unambiguous “right answer” for each of the inputs x. In contrast, for +many sequential decision making and control problems, it is very difficult to +provide this type of explicit supervision to a learning algorithm. For example, +if we have just built a four-legged robot and are trying to program it to walk, +then initially we have no idea what the “correct” actions to take are to make +it walk, and so do not know how to provide explicit supervision for a learning +algorithm to try to mimic. +In the reinforcement learning framework, we will instead provide our algorithms only a reward function, which indicates to the learning agent when +it is doing well, and when it is doing poorly. In the four-legged walking example, the reward function might give the robot positive rewards for moving +forwards, and negative rewards for either moving backwards or falling over. +It will then be the learning algorithm’s job to figure out how to choose actions +over time so as to obtain large rewards. +Reinforcement learning has been successful in applications as diverse as +autonomous helicopter flight, robot legged locomotion, cell-phone network +routing, marketing strategy selection, factory control, and efficient web-page +indexing. Our study of reinforcement learning will begin with a definition of +the Markov decision processes (MDP), which provides the formalism in +which RL problems are usually posed. + +1 + + 2 + +1 + +Markov decision processes + +A Markov decision process is a tuple (S, A, {Psa }, γ, R), where: +• S is a set of states. (For example, in autonomous helicopter flight, S +might be the set of all possible positions and orientations of the helicopter.) +• A is a set of actions. (For example, the set of all possible directions in +which you can push the helicopter’s control sticks.) +• Psa are the state transition probabilities. For each state s ∈ S and +action a ∈ A, Psa is a distribution over the state space. We’ll say more +about this later, but briefly, Psa gives the distribution over what states +we will transition to if we take action a in state s. +• γ ∈ [0, 1) is called the discount factor. +• R : S × A → R is the reward function. (Rewards are sometimes also +written as a function of a state S only, in which case we would have +R : S → R). +The dynamics of an MDP proceeds as follows: We start in some state s0 , +and get to choose some action a0 ∈ A to take in the MDP. As a result of our +choice, the state of the MDP randomly transitions to some successor state +s1 , drawn according to s1 ∼ Ps0 a0 . Then, we get to pick another action a1 . +As a result of this action, the state transitions again, now to some s2 ∼ Ps1 a1 . +We then pick a2 , and so on. . . . Pictorially, we can represent this process as +follows: +a3 +a2 +a1 +a0 +... +s3 −→ +s2 −→ +s1 −→ +s0 −→ +Upon visiting the sequence of states s0 , s1 , . . . with actions a0 , a1 , . . ., our +total payoff is given by +R(s0 , a0 ) + γR(s1 , a1 ) + γ 2 R(s2 , a2 ) + · · · . +Or, when we are writing rewards as a function of the states only, this becomes +R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · . +For most of our development, we will use the simpler state-rewards R(s), +though the generalization to state-action rewards R(s, a) offers no special +difficulties. + + 3 +Our goal in reinforcement learning is to choose actions over time so as to +maximize the expected value of the total payoff: +E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · +Note that the reward at timestep t is discounted by a factor of γ t . Thus, to +make this expectation large, we would like to accrue positive rewards as soon +as possible (and postpone negative rewards as long as possible). In economic +applications where R(·) is the amount of money made, γ also has a natural +interpretation in terms of the interest rate (where a dollar today is worth +more than a dollar tomorrow). +A policy is any function π : S → A mapping from the states to the +actions. We say that we are executing some policy π if, whenever we are +in state s, we take action a = π(s). We also define the value function for +a policy π according to +V π (s) = E R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + · · · s0 = s, π]. +V π (s) is simply the expected sum of discounted rewards upon starting in +state s, and taking actions according to π.1 +Given a fixed policy π, its value function V π satisfies the Bellman equations: +V π (s) = R(s) + γ +Psπ(s) (s )V π (s ). +s ∈S + +This says that the expected sum of discounted rewards V π (s) for starting +in s consists of two terms: First, the immediate reward R(s) that we get +rightaway simply for starting in state s, and second, the expected sum of +future discounted rewards. Examining the second term in more detail, we +see that the summation term above can be rewritten Es ∼Psπ(s) [V π (s )]. This +is the expected sum of discounted rewards for starting in state s , where s +is distributed according Psπ(s) , which is the distribution over where we will +end up after taking the first action π(s) in the MDP from state s. Thus, the +second term above gives the expected sum of discounted rewards obtained +after the first step in the MDP. +Bellman’s equations can be used to efficiently solve for V π . Specifically, +in a finite-state MDP (|S| < ∞), we can write down one such equation for +V π (s) for every state s. This gives us a set of |S| linear equations in |S| +variables (the unknown V π (s)’s, one for each state), which can be efficiently +solved for the V π (s)’s. +1 + +This notation in which we condition on π isn’t technically correct because π isn’t a +random variable, but this is quite standard in the literature. + + 4 +We also define the optimal value function according to +V ∗ (s) = max V π (s). +π + +(1) + +In other words, this is the best possible expected sum of discounted rewards +that can be attained using any policy. There is also a version of Bellman’s +equations for the optimal value function: +V ∗ (s) = R(s) + max γ +a∈A + +Psa (s )V ∗ (s ). + +(2) + +s ∈S + +The first term above is the immediate reward as before. The second term +is the maximum over all actions a of the expected future sum of discounted +rewards we’ll get upon after action a. You should make sure you understand +this equation and see why it makes sense. +We also define a policy π ∗ : S → A as follows: +Psa (s )V ∗ (s ). + +π ∗ (s) = arg max +a∈A + +(3) + +s ∈S + +Note that π ∗ (s) gives the action a that attains the maximum in the “max” +in Equation (2). +It is a fact that for every state s and every policy π, we have +∗ + +V ∗ (s) = V π (s) ≥ V π (s). +∗ + +The first equality says that the V π , the value function for π ∗ , is equal to the +optimal value function V ∗ for every state s. Further, the inequality above +says that π ∗ ’s value is at least a large as the value of any other other policy. +In other words, π ∗ as defined in Equation (3) is the optimal policy. +Note that π ∗ has the interesting property that it is the optimal policy +for all states s. Specifically, it is not the case that if we were starting in +some state s then there’d be some optimal policy for that state, and if we +were starting in some other state s then there’d be some other policy that’s +optimal policy for s . Specifically, the same policy π ∗ attains the maximum +in Equation (1) for all states s. This means that we can use the same policy +π ∗ no matter what the initial state of our MDP is. + +2 + +Value iteration and policy iteration + +We now describe two efficient algorithms for solving finite-state MDPs. For +now, we will consider only MDPs with finite state and action spaces (|S| < +∞, |A| < ∞). +The first algorithm, value iteration, is as follows: + + 5 +1. For each state s, initialize V (s) := 0. +2. Repeat until convergence { +For every state, update V (s) := R(s) + maxa∈A γ + +s + +Psa (s )V (s ). + +} +This algorithm can be thought of as repeatedly trying to update the estimated value function using Bellman Equations (2). +There are two possible ways of performing the updates in the inner loop of +the algorithm. In the first, we can first compute the new values for V (s) for +every state s, and then overwrite all the old values with the new values. This +is called a synchronous update. In this case, the algorithm can be viewed as +implementing a “Bellman backup operator” that takes a current estimate of +the value function, and maps it to a new estimate. (See homework problem +for details.) Alternatively, we can also perform asynchronous updates. +Here, we would loop over the states (in some order), updating the values one +at a time. +Under either synchronous or asynchronous updates, it can be shown that +value iteration will cause V to converge to V ∗ . Having found V ∗ , we can +then use Equation (3) to find the optimal policy. +Apart from value iteration, there is a second standard algorithm for finding an optimal policy for an MDP. The policy iteration algorithm proceeds +as follows: +1. Initialize π randomly. +2. Repeat until convergence { +(a) Let V := V π . +(b) For each state s, let π(s) := arg maxa∈A + +s + +Psa (s )V (s ). + +} +Thus, the inner-loop repeatedly computes the value function for the current +policy, and then updates the policy using the current value function. (The +policy π found in step (b) is also called the policy that is greedy with respect to V .) Note that step (a) can be done via solving Bellman’s equations +as described earlier, which in the case of a fixed policy, is just a set of |S| +linear equations in |S| variables. +After at most a finite number of iterations of this algorithm, V will converge to V ∗ , and π will converge to π ∗ . + + 6 +Both value iteration and policy iteration are standard algorithms for solving MDPs, and there isn’t currently universal agreement over which algorithm is better. For small MDPs, policy iteration is often very fast and +converges with very few iterations. However, for MDPs with large state +spaces, solving for V π explicitly would involve solving a large system of linear equations, and could be difficult. In these problems, value iteration may +be preferred. For this reason, in practice value iteration seems to be used +more often than policy iteration. + +3 + +Learning a model for an MDP + +So far, we have discussed MDPs and algorithms for MDPs assuming that the +state transition probabilities and rewards are known. In many realistic problems, we are not given state transition probabilities and rewards explicitly, +but must instead estimate them from data. (Usually, S, A and γ are known.) +For example, suppose that, for the inverted pendulum problem (see problem set 4), we had a number of trials in the MDP, that proceeded as follows: +(1) a + +(1) + +(1) a + +(1) + +(1) a + +(1) + +(1) a + +(1) + +0 +1 +2 +3 +s0 −→ +s1 −→ +s2 −→ +s3 −→ +... + +(2) a + +(2) + +(2) a + +(2) + +(2) a + +(2) + +(2) a + +(2) + +3 +2 +1 +0 +... +s3 −→ +s2 −→ +s1 −→ +s0 −→ +... + +(j) + +(j) + +Here, si is the state we were at time i of trial j, and ai is the corresponding action that was taken from that state. In practice, each of the +trials above might be run until the MDP terminates (such as if the pole falls +over in the inverted pendulum problem), or it might be run for some large +but finite number of timesteps. +Given this “experience” in the MDP consisting of a number of trials, +we can then easily derive the maximum likelihood estimates for the state +transition probabilities: +Psa (s ) = + +#times took we action a in state s and got to s +#times we took action a in state s + +(4) + +Or, if the ratio above is “0/0”—corresponding to the case of never having +taken action a in state s before—the we might simply estimate Psa (s ) to be +1/|S|. (I.e., estimate Psa to be the uniform distribution over all states.) +Note that, if we gain more experience (observe more trials) in the MDP, +there is an efficient way to update our estimated state transition probabilities + + 7 +using the new experience. Specifically, if we keep around the counts for both +the numerator and denominator terms of (4), then as we observe more trials, +we can simply keep accumulating those counts. Computing the ratio of these +counts then given our estimate of Psa . +Using a similar procedure, if R is unknown, we can also pick our estimate +of the expected immediate reward R(s) in state s to be the average reward +observed in state s. +Having learned a model for the MDP, we can then use either value iteration or policy iteration to solve the MDP using the estimated transition +probabilities and rewards. For example, putting together model learning and +value iteration, here is one possible algorithm for learning in an MDP with +unknown state transition probabilities: +1. Initialize π randomly. +2. Repeat { +(a) Execute π in the MDP for some number of trials. +(b) Using the accumulated experience in the MDP, update our estimates for Psa (and R, if applicable). +(c) Apply value iteration with the estimated state transition probabilities and rewards to get a new estimated value function V . +(d) Update π to be the greedy policy with respect to V . +} +We note that, for this particular algorithm, there is one simple optimization that can make it run much more quickly. Specifically, in the inner loop +of the algorithm where we apply value iteration, if instead of initializing value +iteration with V = 0, we initialize it with the solution found during the previous iteration of our algorithm, then that will provide value iteration with +a much better initial starting point and make it converge more quickly. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes2.txt b/Lectures/aimlcs229/cs229-notes2.txt new file mode 100644 index 0000000..e3c3914 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes2.txt @@ -0,0 +1,1293 @@ +CS229 Lecture notes +Andrew Ng + +Part IV + +Generative Learning algorithms +So far, we’ve mainly been talking about learning algorithms that model +p(y|x; θ), the conditional distribution of y given x. For instance, logistic +regression modeled p(y|x; θ) as hθ (x) = g(θ T x) where g is the sigmoid function. In these notes, we’ll talk about a different type of learning algorithm. +Consider a classification problem in which we want to learn to distinguish +between elephants (y = 1) and dogs (y = 0), based on some features of +an animal. Given a training set, an algorithm like logistic regression or +the perceptron algorithm (basically) tries to find a straight line—that is, a +decision boundary—that separates the elephants and dogs. Then, to classify +a new animal as either an elephant or a dog, it checks on which side of the +decision boundary it falls, and makes its prediction accordingly. +Here’s a different approach. First, looking at elephants, we can build a +model of what elephants look like. Then, looking at dogs, we can build a +separate model of what dogs look like. Finally, to classify a new animal, we +can match the new animal against the elephant model, and match it against +the dog model, to see whether the new animal looks more like the elephants +or more like the dogs we had seen in the training set. +Algorithms that try to learn p(y|x) directly (such as logistic regression), +or algorithms that try to learn mappings directly from the space of inputs X +to the labels {0, 1}, (such as the perceptron algorithm) are called discriminative learning algorithms. Here, we’ll talk about algorithms that instead +try to model p(x|y) (and p(y)). These algorithms are called generative +learning algorithms. For instance, if y indicates whether a example is a dog +(0) or an elephant (1), then p(x|y = 0) models the distribution of dogs’ +features, and p(x|y = 1) models the distribution of elephants’ features. +After modeling p(y) (called the class priors) and p(x|y), our algorithm +1 + + 2 +can then use Bayes rule to derive the posterior distribution on y given x: +p(y|x) = + +p(x|y)p(y) +. +p(x) + +Here, the denominator is given by p(x) = p(x|y = 1)p(y = 1) + p(x|y = +0)p(y = 0) (you should be able to verify that this is true from the standard +properties of probabilities), and thus can also be expressed in terms of the +quantities p(x|y) and p(y) that we’ve learned. Actually, if were calculating +p(y|x) in order to make a prediction, then we don’t actually need to calculate +the denominator, since +p(x|y)p(y) +y +p(x) += arg max p(x|y)p(y). + +arg max p(y|x) = arg max +y + +y + +1 + +Gaussian discriminant analysis + +The first generative learning algorithm that we’ll look at is Gaussian discriminant analysis (GDA). In this model, we’ll assume that p(x|y) is distributed +according to a multivariate normal distribution. Lets talk briefly about the +properties of multivariate normal distributions before moving on to the GDA +model itself. + +1.1 + +The multivariate normal distribution + +The multivariate normal distribution in n-dimensions, also called the multivariate Gaussian distribution, is parameterized by a mean vector µ ∈ Rn +and a covariance matrix Σ ∈ Rn×n , where Σ ≥ 0 is symmetric and positive +semi-definite. Also written “N (µ, Σ)”, its density is given by: +p(x; µ, Σ) = + +1 +(2π)n/2 |Σ|1/2 + +1 +exp − (x − µ)T Σ−1 (x − µ) . +2 + +In the equation above, “|Σ|” denotes the determinant of the matrix Σ. +For a random variable X distributed N (µ, Σ), the mean is (unsurprisingly,) given by µ: +E[X] = + +x p(x; µ, Σ)dx = µ +x + +The covariance of a vector-valued random variable Z is defined as Cov(Z) = +E[(Z − E[Z])(Z − E[Z])T ]. This generalizes the notion of the variance of a + + 3 +real-valued random variable. The covariance can also be defined as Cov(Z) = +E[ZZ T ] − (E[Z])(E[Z])T . (You should be able to prove to yourself that these +two definitions are equivalent.) If X ∼ N (µ, Σ), then +Cov(X) = Σ. +Here’re some examples of what the density of a Gaussian distribution +look like: +0.25 + +0.25 + +0.2 + +0.2 + +0.25 + +0.2 + +0.15 + +0.15 + +0.15 + +0.1 + +0.1 + +0.1 + +0.05 + +0.05 + +0.05 + +3 + +3 +2 + +3 +2 + +3 + +1 +0 +−3 + +3 +2 +0 + +1 + +−1 + +−2 + +−2 + +−3 + +−3 + +0 + +−1 + +−1 + +−2 + +−2 + +1 + +0 + +−1 + +−1 + +−2 + +1 + +2 +0 + +1 +−1 + +2 + +3 + +1 + +2 +0 + +−2 + +−3 + +−3 + +−3 + +The left-most figure shows a Gaussian with mean zero (that is, the 2x1 +zero-vector) and covariance matrix Σ = I (the 2x2 identity matrix). A Gaussian with zero mean and identity covariance is also called the standard normal distribution. The middle figure shows the density of a Gaussian with +zero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I. +We see that as Σ becomes larger, the Gaussian becomes more “spread-out,” +and as it becomes smaller, the distribution becomes more “compressed.” +Lets look at some more examples. +0.25 + +0.25 + +0.25 + +0.2 + +0.2 + +0.2 + +0.15 + +0.15 + +0.15 + +0.1 + +0.1 + +0.1 + +0.05 + +0.05 + +0.05 + +3 + +3 + +3 +2 + +2 + +2 +1 + +1 + +1 +0 + +0 + +0 + +3 +−1 + +2 + +3 + +−1 + +2 + +1 +−2 + +0 + +−1 +−3 + +3 +−1 + +2 +1 + +1 + +0 + +−2 + +0 + +−2 + +−1 + +−1 + +−2 + +−3 + +−3 + +−2 + +−3 + +−3 + +−2 +−3 + +The figures above show Gaussians with mean 0, and with covariance +matrices respectively +Σ= + +1 +0 + +0 +1 + +; Σ= + +1 +0.5 + +0.5 +1 + +; .Σ = + +1 +0.8 + +0.8 +1 + +. + +The leftmost figure shows the familiar standard normal distribution, and we +see that as we increase the off-diagonal entry in Σ, the density becomes more +“compressed” towards the 45◦ line (given by x1 = x2 ). We can see this more +clearly when we look at the contours of the same three densities: + + 4 +3 + +3 + +3 + +2 + +2 + +2 + +1 + +1 + +1 + +0 + +0 + +0 + +−1 + +−1 + +−1 + +−2 + +−2 + +−2 + +−3 + +−3 +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +−3 +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +−1 + +0 + +1 + +2 + +3 + +Here’s one last set of examples generated by varying Σ: +3 + +3 + +3 + +2 + +2 + +2 + +1 + +1 + +1 + +0 + +0 + +0 + +−1 + +−1 + +−1 + +−2 + +−2 + +−2 + +−3 + +−3 +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +−3 +−3 + +−2 + +−1 + +0 + +1 + +2 + +3 + +−3 + +−2 + +The plots above used, respectively, +1 +-0.5 + +Σ= + +-0.5 +1 + +1 +-0.8 + +; Σ= + +-0.8 +1 + +3 +0.8 + +; .Σ = + +0.8 +1 + +. + +From the leftmost and middle figures, we see that by decreasing the diagonal +elements of the covariance matrix, the density now becomes “compressed” +again, but in the opposite direction. Lastly, as we vary the parameters, more +generally the contours will form ellipses (the rightmost figure showing an +example). +As our last set of examples, fixing Σ = I, by varying µ, we can also move +the mean of the density around. + +0.25 + +0.25 + +0.2 + +0.2 + +0.25 + +0.2 + +0.15 + +0.15 + +0.15 + +0.1 + +0.1 + +0.1 + +0.05 + +0.05 + +0.05 + +3 + +3 +2 + +3 + +1 + +2 +0 + +−3 + +3 + +1 + +2 + +2 + +−2 + +−3 + +−3 + +2 +0 + +1 +0 + +−1 + +−1 + +−2 + +3 + +1 + +1 +0 + +−1 + +−1 + +−2 + +3 +2 +0 + +1 +0 + +−1 + +−1 + +−2 + +−2 + +−3 + +−3 + +−2 +−3 + +The figures above were generated using Σ = I, and respectively +µ= + +1 +0 + +; µ= + +-0.5 +0 + +; µ= + +-1 +-1.5 + +. + + 5 + +1.2 + +The Gaussian Discriminant Analysis model + +When we have a classification problem in which the input features x are +continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models p(x|y) using a multivariate normal distribution. The model is: +y ∼ Bernoulli(φ) +x|y = 0 ∼ N (µ0 , Σ) +x|y = 1 ∼ N (µ1 , Σ) + +Writing out the distributions, this is: +p(y) = φy (1 − φ)1−y +1 +1 +exp − (x − µ0 )T Σ−1 (x − µ0 ) +p(x|y = 0) = +n/2 +1/2 +(2π) |Σ| +2 +1 +1 +p(x|y = 1) = +exp +− +(x − µ1 )T Σ−1 (x − µ1 ) +(2π)n/2 |Σ|1/2 +2 +Here, the parameters of our model are φ, Σ, µ0 and µ1 . (Note that while +there’re two different mean vectors µ0 and µ1 , this model is usually applied +using only one covariance matrix Σ.) The log-likelihood of the data is given +by +m + +p(x(i) , y (i) ; φ, µ0 , µ1 , Σ) + +(φ, µ0 , µ1 , Σ) = log +i=1 +m + +p(x(i) |y (i) ; µ0 , µ1 , Σ)p(y (i) ; φ). + += log +i=1 + + 6 +By maximizing with respect to the parameters, we find the maximum likelihood estimate of the parameters (see problem set 1) to be: +φ = +µ0 = +µ1 = +Σ = + +1 +m + +m + +1{y (i) = 1} + +i=1 +m +(i) += 0}x(i) +i=1 1{y +m +(i) = 0} +i=1 1{y +m +(i) += 1}x(i) +i=1 1{y +m +(i) = 1} +i=1 1{y +m +1 +(x(i) − µy(i) )(x(i) +m i=1 + +− µy(i) )T . + +Pictorially, what the algorithm is doing can be seen in as follows: +1 + +0 + +−1 + +−2 + +−3 + +−4 + +−5 + +−6 + +−7 +−2 + +−1 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +Shown in the figure are the training set, as well as the contours of the +two Gaussian distributions that have been fit to the data in each of the +two classes. Note that the two Gaussians have contours that are the same +shape and orientation, since they share a covariance matrix Σ, but they have +different means µ0 and µ1 . Also shown in the figure is the straight line +giving the decision boundary at which p(y = 1|x) = 0.5. On one side of +the boundary, we’ll predict y = 1 to be the most likely outcome, and on the +other side, we’ll predict y = 0. + +1.3 + +Discussion: GDA and logistic regression + +The GDA model has an interesting relationship to logistic regression. If we +view the quantity p(y = 1|x; φ, µ0 , µ1 , Σ) as a function of x, we’ll find that it + + 7 +can be expressed in the form +p(y = 1|x; φ, Σ, µ0 , µ1 ) = + +1 +, +1 + exp(−θ T x) + +where θ is some appropriate function of φ, Σ, µ0 , µ1 .1 This is exactly the form +that logistic regression—a discriminative algorithm—used to model p(y = +1|x). +When would we prefer one model over another? GDA and logistic regression will, in general, give different decision boundaries when trained on the +same dataset. Which is better? +We just argued that if p(x|y) is multivariate gaussian (with shared Σ), +then p(y|x) necessarily follows a logistic function. The converse, however, +is not true; i.e., p(y|x) being a logistic function does not imply p(x|y) is +multivariate gaussian. This shows that GDA makes stronger modeling assumptions about the data than does logistic regression. It turns out that +when these modeling assumptions are correct, then GDA will find better fits +to the data, and is a better model. Specifically, when p(x|y) is indeed gaussian (with shared Σ), then GDA is asymptotically efficient. Informally, +this means that in the limit of very large training sets (large m), there is no +algorithm that is strictly better than GDA (in terms of, say, how accurately +they estimate p(y|x)). In particular, it can be shown that in this setting, +GDA will be a better algorithm than logistic regression; and more generally, +even for small training set sizes, we would generally expect GDA to better. +In contrast, by making significantly weaker assumptions, logistic regression is also more robust and less sensitive to incorrect modeling assumptions. +There are many different sets of assumptions that would lead to p(y|x) taking +the form of a logistic function. For example, if x|y = 0 ∼ Poisson(λ0 ), and +x|y = 1 ∼ Poisson(λ1 ), then p(y|x) will be logistic. Logistic regression will +also work well on Poisson data like this. But if we were to use GDA on such +data—and fit Gaussian distributions to such non-Gaussian data—then the +results will be less predictable, and GDA may (or may not) do well. +To summarize: GDA makes stronger modeling assumptions, and is more +data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Logistic +regression makes weaker assumptions, and is significantly more robust to +deviations from modeling assumptions. Specifically, when the data is indeed non-Gaussian, then in the limit of large datasets, logistic regression will +1 + +This uses the convention of redefining the x(i) ’s on the right-hand-side to be n + 1(i) +dimensional vectors by adding the extra coordinate x0 = 1; see problem set 1. + + 8 +almost always do better than GDA. For this reason, in practice logistic regression is used more often than GDA. (Some related considerations about +discriminative vs. generative models also apply for the Naive Bayes algorithm that we discuss next, but the Naive Bayes algorithm is still considered +a very good, and is certainly also a very popular, classification algorithm.) + +2 + +Naive Bayes + +In GDA, the feature vectors x were continuous, real-valued vectors. Lets now +talk about a different learning algorithm in which the xi ’s are discrete-valued. +For our motivating example, consider building an email spam filter using +machine learning. Here, we wish to classify messages according to whether +they are unsolicited commercial (spam) email, or non-spam email. After +learning to do this, we can then have our mail reader automatically filter +out the spam messages and perhaps place them in a separate mail folder. +Classifying emails is one example of a broader set of problems called text +classification. +Lets say we have a training set (a set of emails labeled as spam or nonspam). We’ll begin our construction of our spam filter by specifying the +features xi used to represent an email. +We will represent an email via a feature vector whose length is equal to +the number of words in the dictionary. Specifically, if an email contains the +i-th word of the dictionary, then we will set xi = 1; otherwise, we let xi = 0. +For instance, the vector +  +a +1 + 0  +aardvark +  + 0  +aardwolf + .  +.. + .  +. +x= .  +  +buy + 1  + .  +.. + ..  +. +zygmurgy +0 +is used to represent an email that contains the words “a” and “buy,” but not +“aardvark,” “aardwolf” or “zygmurgy.”2 The set of words encoded into the +2 + +Actually, rather than looking through an english dictionary for the list of all english +words, in practice it is more common to look through our training set and encode in our +feature vector only the words that occur at least once there. Apart from reducing the +number of words modeled and hence reducing our computational and space requirements, + + 9 +feature vector is called the vocabulary, so the dimension of x is equal to +the size of the vocabulary. +Having chosen our feature vector, we now want to build a discriminative +model. So, we have to model p(x|y). But if we have, say, a vocabulary of +50000 words, then x ∈ {0, 1}50000 (x is a 50000-dimensional vector of 0’s and +1’s), and if we were to model x explicitly with a multinomial distribution over +the 250000 possible outcomes, then we’d end up with a (250000 −1)-dimensional +parameter vector. This is clearly too many parameters. +To model p(x|y), we will therefore make a very strong assumption. We will +assume that the xi ’s are conditionally independent given y. This assumption +is called the Naive Bayes (NB) assumption, and the resulting algorithm is +called the Naive Bayes classifier. For instance, if y = 1 means spam email; +“buy” is word 2087 and “price” is word 39831; then we are assuming that if +I tell you y = 1 (that a particular piece of email is spam), then knowledge +of x2087 (knowledge of whether “buy” appears in the message) will have no +effect on your beliefs about the value of x39831 (whether “price” appears). +More formally, this can be written p(x2087 |y) = p(x2087 |y, x39831 ). (Note that +this is not the same as saying that x2087 and x39831 are independent, which +would have been written “p(x2087 ) = p(x2087 |x39831 )”; rather, we are only +assuming that x2087 and x39831 are conditionally independent given y.) +We now have: +p(x1 , . . . , x50000 |y) += p(x1 |y)p(x2 |y, x1 )p(x3 |y, x1 , x2 ) · · · p(x50000 |y, x1 , . . . , x49999 ) += p(x1 |y)p(x2 |y)p(x3 |y) · · · p(x50000 |y) +n + +p(xi |y) + += +i=1 + +The first equality simply follows from the usual properties of probabilities, +and the second equality used the NB assumption. We note that even though +the Naive Bayes assumption is an extremely strong assumptions, the resulting +algorithm works well on many problems. +Our model is parameterized by φi|y=1 = p(xi = 1|y = 1), φi|y=0 = p(xi = +1|y = 0), and φy = p(y = 1). As usual, given a training set {(x(i) , y (i) ); i = +this also has the advantage of allowing us to model/include as a feature many words +that may appear in your email (such as “cs229”) but that you won’t find in a dictionary. +Sometimes (as in the homework), we also exclude the very high frequency words (which +will be words like “the,” “of,” “and,”; these high frequency, “content free” words are called +stop words) since they occur in so many documents and do little to indicate whether an +email is spam or non-spam. + + 10 +1, . . . , m}, we can write down the joint likelihood of the data: +m + +p(x(i) , y (i) ). + +L(φy , φi|y=0 , φi|y=1 ) = +i=1 + +Maximizing this with respect to φy , φi|y=0 and φi|y=1 gives the maximum +likelihood estimates: +φj|y=1 = +φj|y=0 = +φy = + +m +i=1 +m +i=1 + +(i) + +1{xj = 1 ∧ y (i) = 1} +m +(i) = 1} +i=1 1{y +(i) + +1{xj = 1 ∧ y (i) = 0} +m +(i) = 0} +i=1 1{y +m +(i) += 1} +i=1 1{y +m + +In the equations above, the “∧” symbol means “and.” The parameters have +a very natural interpretation. For instance, φj|y=1 is just the fraction of the +spam (y = 1) emails in which word j does appear. +Having fit all these parameters, to make a prediction on a new example +with features x, we then simply calculate +p(x|y = 1)p(y = 1) +p(x) +( ni=1 p(xi |y = 1)) p(y = 1) +, += +( ni=1 p(xi |y = 1)) p(y = 1) + ( ni=1 p(xi |y = 0)) p(y = 0) + +p(y = 1|x) = + +and pick whichever class has the higher posterior probability. +Lastly, we note that while we have developed the Naive Bayes algorithm +mainly for the case of problems where the features xi are binary-valued, the +generalization to where xi can take values in {1, 2, . . . , ki } is straightforward. +Here, we would simply model p(xi |y) as multinomial rather than as Bernoulli. +Indeed, even if some original input attribute (say, the living area of a house, +as in our earlier example) were continuous valued, it is quite common to +discretize it—that is, turn it into a small set of discrete values—and apply +Naive Bayes. For instance, if we use some feature xi to represent living area, +we might discretize the continuous values as follows: +Living area (sq. feet) < 400 400-800 800-1200 1200-1600 >1600 +xi +1 +2 +3 +4 +5 +Thus, for a house with living area 890 square feet, we would set the value +of the corresponding feature xi to 3. We can then apply the Naive Bayes + + 11 +algorithm, and model p(xi |y) with a multinomial distribution, as described +previously. When the original, continuous-valued attributes are not wellmodeled by a multivariate normal distribution, discretizing the features and +using Naive Bayes (instead of GDA) will often result in a better classifier. + +2.1 + +Laplace smoothing + +The Naive Bayes algorithm as we have described it will work fairly well +for many problems, but there is a simple change that makes it work much +better, especially for text classification. Lets briefly discuss a problem with +the algorithm in its current form, and then talk about how we can fix it. +Consider spam/email classification, and lets suppose that, after completing CS229 and having done excellent work on the project, you decide around +June 2003 to submit the work you did to the NIPS conference for publication. +(NIPS is one of the top machine learning conferences, and the deadline for +submitting a paper is typically in late June or early July.) Because you end +up discussing the conference in your emails, you also start getting messages +with the word “nips” in it. But this is your first NIPS paper, and until this +time, you had not previously seen any emails containing the word “nips”; +in particular “nips” did not ever appear in your training set of spam/nonspam emails. Assuming that “nips” was the 35000th word in the dictionary, +your Naive Bayes spam filter therefore had picked its maximum likelihood +estimates of the parameters φ35000|y to be +φ35000|y=1 = +φ35000|y=0 = + +(i) + +m +i=1 + +1{x35000 = 1 ∧ y (i) = 1} +=0 +m +(i) = 1} +i=1 1{y + +m +i=1 + +1{x35000 = 1 ∧ y (i) = 0} +=0 +m +(i) = 0} +i=1 1{y + +(i) + +I.e., because it has never seen “nips” before in either spam or non-spam +training examples, it thinks the probability of seeing it in either type of email +is zero. Hence, when trying to decide if one of these messages containing +“nips” is spam, it calculates the class posterior probabilities, and obtains +p(y = 1|x) = + +n +i=1 + +n +i=1 + +p(xi |y = 1)p(y = 1) +p(xi |y = 1)p(y = 1) + ni=1 p(xi |y = 0)p(y = 0) + +0 +. +0 +This is because each of the terms “ ni=1 p(xi |y)” includes a term p(x35000 |y) = +0 that is multiplied into it. Hence, our algorithm obtains 0/0, and doesn’t +know how to make a prediction. += + + 12 +Stating the problem more broadly, it is statistically a bad idea to estimate +the probability of some event to be zero just because you haven’t seen it before in your finite training set. Take the problem of estimating the mean of +a multinomial random variable z taking values in {1, . . . , k}. We can parameterize our multinomial with φi = p(z = i). Given a set of m independent +observations {z (1) , . . . , z (m) }, the maximum likelihood estimates are given by +φj = + +m +i=1 + +1{z (i) = j} +. +m + +As we saw previously, if we were to use these maximum likelihood estimates, +then some of the φj ’s might end up as zero, which was a problem. To avoid +this, we can use Laplace smoothing, which replaces the above estimate +with +m +1{z (i) = j} + 1 +φj = i=1 +. +m+k +Here, we’ve added 1 to the numerator, and k to the denominator. Note that +k +j=1 φj = 1 still holds (check this yourself!), which is a desirable property +since the φj ’s are estimates for probabilities that we know must sum to 1. +Also, φj = 0 for all values of j, solving our problem of probabilities being +estimated as zero. Under certain (arguably quite strong) conditions, it can +be shown that the Laplace smoothing actually gives the optimal estimator +of the φj ’s. +Returning to our Naive Bayes classifier, with Laplace smoothing, we +therefore obtain the following estimates of the parameters: +φj|y=1 = +φj|y=0 = + +(i) + +m +i=1 + +1{xj = 1 ∧ y (i) = 1} + 1 +m +(i) = 1} + 2 +i=1 1{y + +m +i=1 + +1{xj = 1 ∧ y (i) = 0} + 1 +m +(i) = 0} + 2 +i=1 1{y + +(i) + +(In practice, it usually doesn’t matter much whether we apply Laplace smoothing to φy or not, since we will typically have a fair fraction each of spam and +non-spam messages, so φy will be a reasonable estimate of p(y = 1) and will +be quite far from 0 anyway.) + +2.2 + +Event models for text classification + +To close off our discussion of generative learning algorithms, lets talk about +one more model that is specifically for text classification. While Naive Bayes + + 13 +as we’ve presented it will work well for many classification problems, for text +classification, there is a related model that does even better. +In the specific context of text classification, Naive Bayes as presented uses +the what’s called the multi-variate Bernoulli event model. In this model, +we assumed that the way an email is generated is that first it is randomly +determined (according to the class priors p(y)) whether a spammer or nonspammer will send you your next message. Then, the person sending the +email runs through the dictionary, deciding whether to include each word i +in that email independently and according to the probabilities p(xi = 1|y) = +φi|y . Thus, the probability of a message was given by p(y) ni=1 p(xi |y). +Here’s a different model, called the multinomial event model. To describe this model, we will use a different notation and set of features for +representing emails. We let xi denote the identity of the i-th word in the +email. Thus, xi is now an integer taking values in {1, . . . , |V |}, where |V | +is the size of our vocabulary (dictionary). An email of n words is now represented by a vector (x1 , x2 , . . . , xn ) of length n; note that n can vary for +different documents. For instance, if an email starts with “A NIPS . . . ,” +then x1 = 1 (“a” is the first word in the dictionary), and x2 = 35000 (if +“nips” is the 35000th word in the dictionary). +In the multinomial event model, we assume that the way an email is +generated is via a random process in which spam/non-spam is first determined (according to p(y)) as before. Then, the sender of the email writes the +email by first generating x1 from some multinomial distribution over words +(p(x1 |y)). Next, the second word x2 is chosen independently of x1 but from +the same multinomial distribution, and similarly for x3 , x4 , and so on, until +all n words of the email have been generated. Thus, the overall probability of +a message is given by p(y) ni=1 p(xi |y). Note that this formula looks like the +one we had earlier for the probability of a message under the multi-variate +Bernoulli event model, but that the terms in the formula now mean very different things. In particular xi |y is now a multinomial, rather than a Bernoulli +distribution. +The parameters for our new model are φy = p(y) as before, φi|y=1 = +p(xj = i|y = 1) (for any j) and φi|y=0 = p(xj = i|y = 0). Note that we have +assumed that p(xj |y) is the same for all values of j (i.e., that the distribution +according to which a word is generated does not depend on its position j +within the email). +If we are given a training set {(x(i) , y (i) ); i = 1, . . . , m} where x(i) = +(i) +(i) +(i) +(x1 , x2 , . . . , xni ) (here, ni is the number of words in the i-training example), + + 14 +the likelihood of the data is given by +m + +p(x(i) , y (i) ) + +L(φ, φi|y=0 , φi|y=1 ) = +i=1 +m + +ni +(i) + +p(xj |y; φi|y=0 , φi|y=1 ) p(y (i) ; φy ). + += +i=1 + +j=1 + +Maximizing this yields the maximum likelihood estimates of the parameters: +φk|y=1 = +φk|y=0 = +φy = + +m +i=1 + +m +i=1 +m +i=1 + +(i) +ni +(i) +j=1 1{xj = k ∧ y +m +(i) = 1}n +i +i=1 1{y +(i) +ni +(i) +j=1 1{xj = k ∧ y +m +(i) = 0}n +i +i=1 1{y +(i) + +1{y +m + += 1} + += 1} += 0} + +. + +If we were to apply Laplace smoothing (which needed in practice for good +performance) when estimating φk|y=0 and φk|y=1 , we add 1 to the numerators +and |V | to the denominators, and obtain: +φk|y=1 = +φk|y=0 = + +m +i=1 + +m +i=1 + +(i) +ni +j=1 1{xj +m +(i) +i=1 1{y +(i) +ni +j=1 1{xj +m +(i) +i=1 1{y + += k ∧ y (i) = 1} + 1 += 1}ni + |V | += k ∧ y (i) = 0} + 1 +. += 0}ni + |V | + +While not necessarily the very best classification algorithm, the Naive Bayes +classifier often works surprisingly well. It is often also a very good “first thing +to try,” given its simplicity and ease of implementation. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes3.txt b/Lectures/aimlcs229/cs229-notes3.txt new file mode 100644 index 0000000..b5e7706 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes3.txt @@ -0,0 +1,1379 @@ +CS229 Lecture notes +Andrew Ng + +Part V + +Support Vector Machines +This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe is indeed the best) +“off-the-shelf” supervised learning algorithm. To tell the SVM story, we’ll +need to first talk about margins and the idea of separating data with a large +“gap.” Next, we’ll talk about the optimal margin classifier, which will lead +us into a digression on Lagrange duality. We’ll also see kernels, which give +a way to apply SVMs efficiently in very high dimensional (such as infinitedimensional) feature spaces, and finally, we’ll close off the story with the +SMO algorithm, which gives an efficient implementation of SVMs. + +1 + +Margins: Intuition + +We’ll start our story on SVMs by talking about margins. This section will +give the intuitions about margins and about the “confidence” of our predictions; these ideas will be made formal in Section 3. +Consider logistic regression, where the probability p(y = 1|x; θ) is modeled by hθ (x) = g(θ T x). We would then predict “1” on an input x if and +only if hθ (x) ≥ 0.5, or equivalently, if and only if θ T x ≥ 0. Consider a +positive training example (y = 1). The larger θ T x is, the larger also is +hθ (x) = p(y = 1|x; w, b), and thus also the higher our degree of “confidence” +that the label is 1. Thus, informally we can think of our prediction as being +a very confident one that y = 1 if θ T x +0. Similarly, we think of logistic +regression as making a very confident prediction of y = 0, if θ T x +0. Given +a training set, again informally it seems that we’d have found a good fit to +the training data if we can find θ so that θ T x(i) +0 whenever y (i) = 1, and +1 + + 2 +θT x(i) +0 whenever y (i) = 0, since this would reflect a very confident (and +correct) set of classifications for all the training examples. This seems to be +a nice goal to aim for, and we’ll soon formalize this idea using the notion of +functional margins. +For a different type of intuition, consider the following figure, in which x’s +represent positive training examples, o’s denote negative training examples, +a decision boundary (this is the line given by the equation θ T x = 0, and +is also called the separating hyperplane) is also shown, and three points +have also been labeled A, B and C. + +A + +C +☎✆ + +B + + ✁ + +✂✄ + +Notice that the point A is very far from the decision boundary. If we are +asked to make a prediction for the value of y at at A, it seems we should be +quite confident that y = 1 there. Conversely, the point C is very close to +the decision boundary, and while it’s on the side of the decision boundary +on which we would predict y = 1, it seems likely that just a small change to +the decision boundary could easily have caused out prediction to be y = 0. +Hence, we’re much more confident about our prediction at A than at C. The +point B lies in-between these two cases, and more broadly, we see that if +a point is far from the separating hyperplane, then we may be significantly +more confident in our predictions. Again, informally we think it’d be nice if, +given a training set, we manage to find a decision boundary that allows us +to make all correct and confident (meaning far from the decision boundary) +predictions on the training examples. We’ll formalize this later using the +notion of geometric margins. + + 3 + +2 + +Notation + +To make our discussion of SVMs easier, we’ll first need to introduce a new +notation for talking about classification. We will be considering a linear +classifier for a binary classification problem with labels y and features x. +From now, we’ll use y ∈ {−1, 1} (instead of {0, 1}) to denote the class labels. +Also, rather than parameterizing our linear classifier with the vector θ, we +will use parameters w, b, and write our classifier as +hw,b (x) = g(w T x + b). +Here, g(z) = 1 if z ≥ 0, and g(z) = −1 otherwise. This “w, b” notation +allows us to explicitly treat the intercept term b separately from the other +parameters. (We also drop the convention we had previously of letting x0 = 1 +be an extra coordinate in the input feature vector.) Thus, b takes the role of +what was previously θ0 , and w takes the role of [θ1 . . . θn ]T . +Note also that, from our definition of g above, our classifier will directly +predict either 1 or −1 (cf. the perceptron algorithm), without first going +through the intermediate step of estimating the probability of y being 1 +(which was what logistic regression did). + +3 + +Functional and geometric margins + +Lets formalize the notions of the functional and geometric margins. Given a +training example (x(i) , y (i) ), we define the functional margin of (w, b) with +respect to the training example +γˆ (i) = y (i) (wT x + b). +Note that if y (i) = 1, then for the functional margin to be large (i.e., for our +prediction to be confident and correct), then we need w T x + b to be a large +positive number. Conversely, if y (i) = −1, then for the functional margin to +be large, then we need w T x + b to be a large negative number. Moreover, +if y (i) (wT x + b) > 0, then our prediction on this example is correct. (Check +this yourself.) Hence, a large functional margin represents a confident and a +correct prediction. +For a linear classifier with the choice of g given above (taking values in +{−1, 1}), there’s one property of the functional margin that makes it not a +very good measure of confidence, however. Given our choice of g, we note that +if we replace w with 2w and b with 2b, then since g(w T x + b) = g(2w T x + 2b), + + 4 +this would not change hw,b (x) at all. I.e., g, and hence also hw,b (x), depends +only on the sign, but not on the magnitude, of w T x + b. However, replacing +(w, b) with (2w, 2b) also results in multiplying our functional margin by a +factor of 2. Thus, it seems that by exploiting our freedom to scale w and b, +we can make the functional margin arbitrarily large without really changing +anything meaningful. Intuitively, it might therefore make sense to impose +some sort of normalization condition such as that ||w||2 = 1; i.e., we might +replace (w, b) with (w/||w||2 , b/||w||2 ), and instead consider the functional +margin of (w/||w||2 , b/||w||2 ). We’ll come back to this later. +Given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define the +function margin of (w, b) with respect to S as the smallest of the functional +margins of the individual training examples. Denoted by γˆ , this can therefore +be written: +γˆ = min γˆ (i) . +i=1,...,m + +Next, lets talk about geometric margins. Consider the picture below: + +A + +w + +γ (i) +B + +The decision boundary corresponding to (w, b) is shown, along with the +vector w. Note that w is orthogonal (at 90◦ ) to the separating hyperplane. +(You should convince yourself that this must be the case.) Consider the +point at A, which represents the input x(i) of some training example with +label y (i) = 1. Its distance to the decision boundary, γ (i) , is given by the line +segment AB. +How can we find the value of γ (i) ? Well, w/||w|| is a unit-length vector +pointing in the same direction as w. Since A represents x(i) , we therefore + + 5 +find that the point B is given by x(i) − γ (i) · w/||w||. But this point lies on +the decision boundary, and all points x on the decision boundary satisfy the +equation w T x + b = 0. Hence, +wT + +x(i) − γ (i) + +w +||w|| + ++ b = 0. + +Solving for γ (i) yields +γ (i) = + +wT x(i) + b += +||w|| + +w +||w|| + +T + +x(i) + + +b +. +||w|| + +This was worked out for the case of a positive training example at A in the +figure, where being on the “positive” side of the decision boundary is good. +More generally, we define the geometric margin of (w, b) with respect to a +training example (x(i) , y (i) ) to be +γ + +(i) + +=y + +(i) + +w +||w|| + +T + +x(i) + + +b +||w|| + +. + +Note that if ||w|| = 1, then the functional margin equals the geometric +margin—this thus gives us a way of relating these two different notions of +margin. Also, the geometric margin is invariant to rescaling of the parameters; i.e., if we replace w with 2w and b with 2b, then the geometric margin +does not change. This will in fact come in handy later. Specifically, because +of this invariance to the scaling of the parameters, when trying to fit w and b +to training data, we can impose an arbitrary scaling constraint on w without +changing anything important; for instance, we can demand that ||w|| = 1, or +|w1 | = 5, or |w1 + b| + |w2 | = 2, and any of these can be satisfied simply by +rescaling w and b. +Finally, given a training set S = {(x(i) , y (i) ); i = 1, . . . , m}, we also define +the geometric margin of (w, b) with respect to S to be the smallest of the +geometric margins on the individual training examples: +γ = min γ (i) . +i=1,...,m + +4 + +The optimal margin classifier + +Given a training set, it seems from our previous discussion that a natural +desideratum is to try to find a decision boundary that maximizes the (geometric) margin, since this would reflect a very confident set of predictions + + 6 +on the training set and a good “fit” to the training data. Specifically, this +will result in a classifier that separates the positive and the negative training +examples with a “gap” (geometric margin). +For now, we will assume that we are given a training set that is linearly +separable; i.e., that it is possible to separate the positive and negative examples using some separating hyperplane. How we we find the one that +achieves the maximum geometric margin? We can pose the following optimization problem: +maxγ,w,b γ +s.t. y (i) (wT x(i) + b) ≥ γ, i = 1, . . . , m +||w|| = 1. +I.e., we want to maximize γ, subject to each training example having functional margin at least γ. The ||w|| = 1 constraint moreover ensures that the +functional margin equals to the geometric margin, so we are also guaranteed +that all the geometric margins are at least γ. Thus, solving this problem will +result in (w, b) with the largest possible geometric margin with respect to the +training set. +If we could solve the optimization problem above, we’d be done. But the +“||w|| = 1” constraint is a nasty (non-convex) one, and this problem certainly +isn’t in any format that we can plug into standard optimization software to +solve. So, lets try transforming the problem into a nicer one. Consider: +γˆ +||w|| +s.t. y (i) (wT x(i) + b) ≥ γˆ , i = 1, . . . , m + +maxγ,w,b + +Here, we’re going to maximize γˆ /||w||, subject to the functional margins all +being at least γˆ . Since the geometric and functional margins are related by +γ = γˆ /||w|, this will give us the answer we want. Moreover, we’ve gotten rid +of the constraint ||w|| = 1 that we didn’t like. The downside is that we now +γ +ˆ +function; and, we still don’t +have a nasty (again, non-convex) objective ||w|| +have any off-the-shelf software that can solve this form of an optimization +problem. +Lets keep going. Recall our earlier discussion that we can add an arbitrary +scaling constraint on w and b without changing anything. This is the key idea +we’ll use now. We will introduce the scaling constraint that the functional +margin of w, b with respect to the training set must be 1: +γˆ = 1. + + 7 +Since multiplying w and b by some constant results in the functional margin +being multiplied by that same constant, this is indeed a scaling constraint, +and can be satisfied by rescaling w, b. Plugging this into our problem above, +and noting that maximizing γˆ /||w|| = 1/||w|| is the same thing as minimizing +||w||2 , we now have the following optimization problem: +1 +||w||2 +2 +s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , m + +minγ,w,b + +We’ve now transformed the problem into a form that can be efficiently +solved. The above is an optimization problem with a convex quadratic objective and only linear constraints. Its solution gives us the optimal margin classifier. This optimization problem can be solved using commercial +quadratic programming (QP) code.1 +While we could call the problem solved here, what we will instead do is +make a digression to talk about Lagrange duality. This will lead us to our +optimization problem’s dual form, which will play a key role in allowing us to +use kernels to get optimal margin classifiers to work efficiently in very high +dimensional spaces. The dual form will also allow us to derive an efficient +algorithm for solving the above optimization problem that will typically do +much better than generic QP software. + +5 + +Lagrange duality + +Lets temporarily put aside SVMs and maximum margin classifiers, and talk +about solving constrained optimization problems. +Consider a problem of the following form: +minw f (w) +s.t. hi (w) = 0, i = 1, . . . , l. +Some of you may recall how the method of Lagrange multipliers can be used +to solve it. (Don’t worry if you haven’t seen it before.) In this method, we +define the Lagrangian to be +l + +L(w, β) = f (w) + +1 + +βi hi (w) +i=1 + +You may be familiar with linear programming, which solves optimization problems +that have linear objectives and linear constraints. QP software is also widely available, +which allows convex quadratic objectives and linear constraints. + + 8 +Here, the βi ’s are called the Lagrange multipliers. We would then find +and set L’s partial derivatives to zero: +∂L +∂L += 0; += 0, +∂wi +∂βi +and solve for w and β. +In this section, we will generalize this to constrained optimization problems in which we may have inequality as well as equality constraints. Due to +time constraints, we won’t really be able to do the theory of Lagrange duality +justice in this class,2 but we will give the main ideas and results, which we +will then apply to our optimal margin classifier’s optimization problem. +Consider the following, which we’ll call the primal optimization problem: +minw f (w) +s.t. gi (w) ≤ 0, i = 1, . . . , k +hi (w) = 0, i = 1, . . . , l. +To solve it, we start by defining the generalized Lagrangian +k + +L(w, α, β) = f (w) + + +l + +αi gi (w) + +i=1 + +βi hi (w). +i=1 + +Here, the αi ’s and βi ’s are the Lagrange multipliers. Consider the quantity +θP (w) = max L(w, α, β). +α,β : αi ≥0 + +Here, the “P” subscript stands for “primal.” Let some w be given. If w +violates any of the primal constraints (i.e., if either gi (w) > 0 or hi (w) = 0 +for some i), then you should be able to verify that +k + +θP (w) = + +max f (w) + + +α,β : αi ≥0 + +l + +αi gi (w) + +i=1 + +βi hi (w) + +(1) + +i=1 + += ∞. + +(2) + +Conversely, if the constraints are indeed satisfied for a particular value of w, +then θP (w) = f (w). Hence, +θP (w) = +2 + +f (w) if w satisfies primal constraints +∞ +otherwise. + +Readers interested in learning more about this topic are encouraged to read, e.g., R. +T. Rockarfeller (1970), Convex Analysis, Princeton University Press. + + 9 +Thus, θP takes the same value as the objective in our problem for all values of w that satisfies the primal constraints, and is positive infinity if the +constraints are violated. Hence, if we consider the minimization problem +min θP (w) = min max L(w, α, β), +w + +w + +α,β : αi ≥0 + +we see that it is the same problem (i.e., and has the same solutions as) our +original, primal problem. For later use, we also define the optimal value of +the objective to be p∗ = minw θP (w); we call this the value of the primal +problem. +Now, lets look at a slightly different problem. We define +θD (α, β) = min L(w, α, β). +w + +Here, the “D” subscript stands for “dual.” Note also that whereas in the +definition of θP we were optimizing (maximizing) with respect to α, β, here +are are minimizing with respect to w. +We can now pose the dual optimization problem: +max θD (α, β) = max min L(w, α, β). + +α,β : αi ≥0 + +α,β : αi ≥0 + +w + +This is exactly the same as our primal problem shown above, except that the +order of the “max” and the “min” are now exchanged. We also define the +optimal value of the dual problem’s objective to be d∗ = maxα,β : αi ≥0 θD (w). +How are the primal and the dual problems related? It can easily be shown +that +d∗ = max min L(w, α, β) ≤ min max L(w, α, β) = p∗ . +α,β : αi ≥0 + +w + +w + +α,β : αi ≥0 + +(You should convince yourself of this; this follows from the “max min” of a +function always being less than or equal to the “min max.”) However, under +certain conditions, we will have +d∗ = p ∗ , +so that we can solve the dual problem in lieu of the primal problem. Lets +see what these conditions are. +Suppose f and the gi ’s are convex,3 and the hi ’s are affine.4 Suppose +further that the constraints gi are (strictly) feasible; this means that there +exists some w so that gi (w) < 0 for all i. +3 + +When f has a Hessian, then it is convex if and only if the hessian is positive semidefinite. For instance, f (w) = w T w is convex; similarly, all linear (and affine) functions +are also convex. (A function f can also be convex without being differentiable, but we +won’t need those more general definitions of convexity here.) +4 +I.e., there exists ai , bi , so that hi (w) = aTi w + bi . “Affine” means the same thing as +linear, except that we also allow the extra intercept term bi . + + 10 +Under our above assumptions, there must exist w ∗ , α∗ , β ∗ so that w ∗ is the +solution to the primal problem, α∗ , β ∗ are the solution to the dual problem, +and moreover p∗ = d∗ = L(w∗ , α∗ , β ∗ ). Moreover, w ∗ , α∗ and β ∗ satisfy the +Karush-Kuhn-Tucker (KKT) conditions, which are as follows: +∂ +L(w∗ , α∗ , β ∗ ) +∂wi +∂ +L(w∗ , α∗ , β ∗ ) +∂βi +αi∗ gi (w∗ ) +gi (w∗ ) +α∗ + += 0, i = 1, . . . , n + +(3) + += 0, i = 1, . . . , l + +(4) + += 0, i = 1, . . . , k +≤ 0, i = 1, . . . , k +≥ 0, i = 1, . . . , k + +(5) +(6) +(7) + +Moreover, if some w ∗ , α∗ , β ∗ satisfy the KKT conditions, then it is also a +solution to the primal and dual problems. +We draw attention to Equation (5), which is called the KKT dual complementarity condition. Specifically, it implies that if αi∗ > 0, then gi (w∗ ) = +0. (I.e., the “gi (w) ≤ 0” constraint is active, meaning it holds with equality +rather than with inequality.) Later on, this will be key for showing that the +SVM has only a small number of “support vectors”; the KKT dual complementarity condition will also give us our convergence test when we talk about +the SMO algorithm. + +6 + +Optimal margin classifiers + +Previously, we posed the following (primal) optimization problem for finding +the optimal margin classifier: +1 +||w||2 +2 +s.t. y (i) (wT x(i) + b) ≥ 1, i = 1, . . . , m + +minγ,w,b + +We can write the constraints as +gi (w) = −y (i) (wT x(i) + b) + 1 ≤ 0. +We have one such constraint for each training example. Note that from the +KKT dual complementarity condition, we will have αi > 0 only for the training examples that have functional margin exactly equal to one (i.e., the ones + + 11 +corresponding to constraints that hold with equality, gi (w) = 0). Consider +the figure below, in which a maximum margin separating hyperplane is shown +by the solid line. + +The points with the smallest margins are exactly the ones closest to the +decision boundary; here, these are the three points (one negative and two positive examples) that lie on the dashed lines parallel to the decision boundary. +Thus, only three of the αi ’s—namely, the ones corresponding to these three +training examples—will be non-zero at the optimal solution to our optimization problem. These three points are called the support vectors in this +problem. The fact that the number of support vectors can be much smaller +than the size the training set will be useful later. +Lets move on. Looking ahead, as we develop the dual form of the problem, +one key idea to watch out for is that we’ll try to write our algorithm in terms +of only the inner product x(i) , x(j) (think of this as (x(i) )T x(j) ) between +points in the input feature space. The fact that we can express our algorithm +in terms of these inner products will be key when we apply the kernel trick. +When we construct the Lagrangian for our optimization problem we have: +1 +L(w, b, α) = ||w||2 − +2 + +m + +i=1 + +αi y (i) (wT x(i) + b) − 1 . + +(8) + +Note that there’re only “αi ” but no “βi ” Lagrange multipliers, since the +problem has only inequality constraints. +Lets find the dual form of the problem. To do so, we need to first minimize +L(w, b, α) with respect to w and b (for fixed α), to get θD , which we’ll do by + + 12 +setting the derivatives of L with respect to w and b to zero. We have: +m + +∇w L(w, b, α) = w − +This implies that + +αi y (i) x(i) = 0 +i=1 + +m + +αi y (i) x(i) . + +w= + +(9) + +i=1 + +As for the derivative with respect to b, we obtain +∂ +L(w, b, α) = +∂b + +m + +αi y (i) = 0. + +(10) + +i=1 + +If we take the definition of w in Equation (9) and plug that back into the +Lagrangian (Equation 8), and simplify, we get +m + +L(w, b, α) = + +i=1 + +m + +1 +y (i) y (j) αi αj (x(i) )T x(j) − b +αi − +2 i,j=1 + +m + +αi y (i) . +i=1 + +But from Equation (10), the last term must be zero, so we obtain +m + +L(w, b, α) = + +i=1 + +m + +1 +y (i) y (j) αi αj (x(i) )T x(j) . +αi − +2 i,j=1 + +Recall that we got to the equation above by minimizing L with respect to w +and b. Putting this together with the constraints αi ≥ 0 (that we always had) +and the constraint (10), we obtain the following dual optimization problem: +m + +maxα W (α) = +i=1 + +m + +1 +y (i) y (j) αi αj x(i) , x(j) . +αi − +2 i,j=1 + +s.t. αi ≥ 0, i = 1, . . . , m +m + +αi y (i) = 0, +i=1 + +You should also be able to verify that the conditions required for p∗ = +d∗ and the KKT conditions (Equations 3–7) to hold are indeed satisfied in +our optimization problem. Hence, we can solve the dual in lieu of solving +the primal problem. Specifically, in the dual problem above, we have a +maximization problem in which the parameters are the αi ’s. We’ll talk later + + 13 +about the specific algorithm that we’re going to use to solve the dual problem, +but if we are indeed able to solve it (i.e., find the α’s that maximize W (α) +subject to the constraints), then we can use Equation (9) to go back and find +the optimal w’s as a function of the α’s. Having found w ∗ , by considering +the primal problem, it is also straightforward to find the optimal value for +the intercept term b as +maxi:y(i) =−1 w∗ T x(i) + mini:y(i) =1 w∗ T x(i) +. +(11) +2 +(Check for yourself that this is correct.) +Before moving on, lets also take a more careful look at Equation (9), which +gives the optimal value of w in terms of (the optimal value of) α. Suppose +we’ve fit our model’s parameters to a training set, and now wish to make a +prediction at a new point input x. We would then calculate w T x + b, and +predict y = 1 if and only if this quantity is bigger than zero. But using (9), +this quantity can also be written: +b∗ = − + +T + +m +(i) (i) + +T + +w x+b = + +αi y x + +x+b + +(12) + +i=1 +m + +αi y (i) x(i) , x + b. + += + +(13) + +i=1 + +Hence, if we’ve found the αi ’s, in order to make a prediction, we have to +calculate a quantity that depends only on the inner product between x and +the points in the training set. Moreover, we saw earlier that the αi ’s will all +be zero except for the support vectors. Thus, many of the terms in the sum +above will be zero, and we really need to find only the inner products between +x and the support vectors (of which there is often only a small number) in +order calculate (13) and make our prediction. +By examining the dual form of the optimization problem, we gained significant insight into the structure of the problem, and were also able to write +the entire algorithm in terms of only inner products between input feature +vectors. In the next section, we will exploit this property to apply the kernels to our classification problem. The resulting algorithm, support vector +machines, will be able to efficiently learn in very high dimensional spaces. + +7 + +Kernels + +Back in our discussion of linear regression, we had a problem in which the +input x was the living area of a house, and we considered performing regres- + + 14 +sion using the features x, x2 and x3 (say) to obtain a cubic function. To +distinguish between these two sets of variables, we’ll call the “original” input +value the input attributes of a problem (in this case, x, the living area). +When that is mapped to some new set of quantities that are then passed to +the learning algorithm, we’ll call those new quantities the input features. +(Unfortunately, different authors use different terms to describe these two +things, but we’ll try to use this terminology consistently in these notes.) We +will also let φ denote the feature mapping, which maps from the attributes +to the features. For instance, in our example, we had + + +x +φ(x) =  x2  . +x3 + +Rather than applying SVMs using the original input attributes x, we may +instead want to learn using some features φ(x). To do so, we simply need to +go over our previous algorithm, and replace x everywhere in it with φ(x). +Since the algorithm can be written entirely in terms of the inner products x, z , this means that we would replace all those inner products with +φ(x), φ(z) . Specificically, given a feature mapping φ, we define the corresponding Kernel to be +K(x, z) = φ(x)T φ(z). + +Then, everywhere we previously had x, z in our algorithm, we could simply +replace it with K(x, z), and our algorithm would now be learning using the +features φ. +Now, given φ, we could easily compute K(x, z) by finding φ(x) and φ(z) +and taking their inner product. But what’s more interesting is that often, +K(x, z) may be very inexpensive to calculate, even though φ(x) itself may +be very expensive to calculate (perhaps because it is an extremely high dimensional vector). In such settings, by using in our algorithm an efficient +way to calculate K(x, z), we can get SVMs to learn in the high dimensional +feature space space given by φ, but without ever having to explicitly find or +represent vectors φ(x). +Lets see an example. Suppose x, z ∈ Rn , and consider +K(x, z) = (xT z)2 . + + 15 +We can also write this as +n + +n + +xi z i + +K(x, z) = + +xi z i + +i=1 +n + +j=1 +n + +xi xj z i z j + += +i=1 j=1 +n + +(xi xj )(zi zj ) + += +i,j=1 + +Thus, we see that K(x, z) = φ(x)T φ(z), where the feature mapping φ is given +(shown here for the case of n = 3) by + + +x1 x1 + x1 x2  + + + x1 x3  + + + x2 x1  + + +. +x +x +φ(x) =  +2 +2 + + + x2 x3  + + + x3 x1  + + + x3 x2  +x3 x3 + +Note that whereas calculating the high-dimensional φ(x) requires O(n2 ) time, +finding K(x, z) takes only O(n) time—linear in the dimension of the input +attributes. +For a related kernel, also consider +K(x, z) = (xT z + c)2 +n + += + +n + +(xi xj )(zi zj ) + +i,j=1 + +√ +√ +( 2cxi )( 2czi ) + c2 . + +i=1 + +(Check this yourself.) This corresponds to the feature mapping (again shown + + 16 +for n = 3) + + + + + + + + + + + + +φ(x) =  + + + + + + + + + + + +x1 x1 +x1 x2 +x1 x3 +x2 x1 +x2 x2 +x2 x3 +x3 x1 +x3 x2 +x +√ 3 x3 +√2cx1 +√2cx2 +2cx3 +c + + + + + + + + + + + + + +, + + + + + + + + + + + +and the parameter c controls the relative weighting between the xi (first +order) and the xi xj (second order) terms. +More broadly, the kernel K(x, z) = (xT z + c)d corresponds to a feature +feature space, corresponding of all monomials of the +mapping to an n+d +d +form xi1 xi2 . . . xik that are up to order d. However, despite working in this +O(nd )-dimensional space, computing K(x, z) still takes only O(n) time, and +hence we never need to explicitly represent feature vectors in this very high +dimensional feature space. +Now, lets talk about a slightly different view of kernels. Intuitively, (and +there are things wrong with this intuition, but nevermind), if φ(x) and φ(z) +are close together, then we might expect K(x, z) = φ(x)T φ(z) to be large. +Conversely, if φ(x) and φ(z) are far apart—say nearly orthogonal to each +other—then K(x, z) = φ(x)T φ(z) will be small. So, we can think of K(x, z) +as some measurement of how similar are φ(x) and φ(z), or of how similar are +x and z. +Given this intuition, suppose that for some learning problem that you’re +working on, you’ve come up with some function K(x, z) that you think might +be a reasonable measure of how similar x and z are. For instance, perhaps +you chose +||x − z||2 +. +K(x, z) = exp − +2σ 2 +This is a resonable measure of x and z’s similarity, and is close to 1 when +x and z are close, and near 0 when x and z are far apart. Can we use this +definition of K as the kernel in an SVM? In this particular example, the +answer is yes. (This kernel is called the Gaussian kernel, and corresponds + + 17 +to an infinite dimensional feature mapping φ.) But more broadly, given some +function K, how can we tell if it’s a valid kernel; i.e., can we tell if there is +some feature mapping φ so that K(x, z) = φ(x)T φ(z) for all x, z? +Suppose for now that K is indeed a valid kernel corresponding to some +feature mapping φ. Now, consider some finite set of m points (not necessarily +the training set) {x(1) , . . . , x(m) }, and let a square, m-by-m matrix K be +defined so that its (i, j)-entry is given by Kij = K(x(i) , x(j) ). This matrix +is called the Kernel matrix. Note that we’ve overloaded the notation and +used K to denote both the kernel function K(x, z) and the kernel matrix K, +due to their obvious close relationship. +Now, if K is a valid Kernel, then Kij = K(x(i) , x(j) ) = φ(x(i) )T φ(x(j) ) = +φ(x(j) )T φ(x(i) ) = K(x(j) , x(i) ) = Kji , and hence K must be symmetric. Moreover, letting φk (x) denote the k-th coordinate of the vector φ(x), we find that +for any vector z, we have +z T Kz = + +zi Kij zj +i + +j + +zi φ(x(i) )T φ(x(j) )zj + += +i + +j + += + +φk (x(i) )φk (x(j) )zj + +zi +i + +j + +k + +zi φk (x(i) )φk (x(j) )zj + += +k + +i + +j +2 +(i) + += + +zi φk (x ) +k + +i + +≥ 0. +The second-to-last step above used the same trick as you saw in Problem +set 1 Q1. Since z was arbitrary, this shows that K is positive semi-definite +(K ≥ 0). +Hence, we’ve shown that if K is a valid kernel (i.e., if it corresponds to +some feature mapping φ), then the corresponding Kernel matrix K ∈ Rm×m +is symmetric positive semidefinite. More generally, this turns out to be not +only a necessary, but also a sufficient, condition for K to be a valid kernel +(also called a Mercer kernel). The following result is due to Mercer.5 +5 + +Many texts present Mercer’s theorem in a slightly more complicated form involving +L functions, but when the input attributes take values in Rn , the version given here is +equivalent. +2 + + 18 +Theorem (Mercer). Let K : Rn × Rn → R be given. Then for K +to be a valid (Mercer) kernel, it is necessary and sufficient that for any +{x(1) , . . . , x(m) }, (m < ∞), the corresponding kernel matrix is symmetric +positive semi-definite. +Given a function K, apart from trying to find a feature mapping φ that +corresponds to it, this theorem therefore gives another way of testing if it is +a valid kernel. You’ll also have a chance to play with these ideas more in +problem set 2. +In class, we also briefly talked about a couple of other examples of kernels. For instance, consider the digit recognition problem, in which given +an image (16x16 pixels) of a handwritten digit (0-9), we have to figure out +which digit it was. Using either a simple polynomial kernel K(x, z) = (x T z)d +or the Gaussian kernel, SVMs were able to obtain extremely good performance on this problem. This was particularly surprising since the input +attributes x were just a 256-dimensional vector of the image pixel intensity +values, and the system had no prior knowledge about vision, or even about +which pixels are adjacent to which other ones. Another example that we +briefly talked about in lecture was that if the objects x that we are trying +to classify are strings (say, x is a list of amino acids, which strung together +form a protein), then it seems hard to construct a reasonable, “small” set of +features for most learning algorithms, especially if different strings have different lengths. However, consider letting φ(x) be a feature vector that counts +the number of occurrences of each length-k substring in x. If we’re considering strings of english alphabets, then there’re 26k such strings. Hence, φ(x) +is a 26k dimensional vector; even for moderate values of k, this is probably +too big for us to efficiently work with. (e.g., 264 ≈ 460000.) However, using +(dynamic programming-ish) string matching algorithms, it is possible to efficiently compute K(x, z) = φ(x)T φ(z), so that we can now implicitly work +in this 26k -dimensional feature space, but without ever explicitly computing +feature vectors in this space. +The application of kernels to support vector machines should already +be clear and so we won’t dwell too much longer on it here. Keep in mind +however that the idea of kernels has significantly broader applicability than +SVMs. Specifically, if you have any learning algorithm that you can write +in terms of only inner products x, z between input attribute vectors, then +by replacing this with K(x, z) where K is a kernel, you can “magically” +allow your algorithm to work efficiently in the high dimensional feature space +corresponding to K. For instance, this kernel trick can be applied with +the perceptron to to derive a kernel perceptron algorithm. Many of the + + 19 +algorithms that we’ll see later in this class will also be amenable to this +method, which has come to be known as the “kernel trick.” + +8 + +Regularization and the non-separable case + +The derivation of the SVM as presented so far assumed that the data is +linearly separable. While mapping data to a high dimensional feature space +via φ does generally increase the likelihood that the data is separable, we +can’t guarantee that it always will be so. Also, in some cases it is not clear +that finding a separating hyperplane is exactly what we’d want to do, since +that might be susceptible to outliers. For instance, the left figure below +shows an optimal margin classifier, and when a single outlier is added in the +upper-left region (right figure), it causes the decision boundary to make a +dramatic swing, and the resulting classifier has a much smaller margin. + +To make the algorithm work for non-linearly separable datasets as well +as be less sensitive to outliers, we reformulate our optimization (using 1 +regularization) as follows: +minγ,w,b + +1 +||w||2 + C +2 + +m + +ξi +i=1 + +s.t. y (i) (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m +ξi ≥ 0, i = 1, . . . , m. +Thus, examples are now permitted to have (functional) margin less than 1, +and if an example whose functional margin is 1 − ξi , we would pay a cost of +the objective function being increased by Cξi . The parameter C controls the +relative weighting between the twin goals of making the ||w||2 large (which +we saw earlier makes the margin small) and of ensuring that most examples +have functional margin at least 1. + + 20 +As before, we can form the Lagrangian: +1 +L(w, b, ξ, α, r) = wT w + C +2 + +m + +m + +m + +i=1 + +ξi − + +i=1 + +αi y (i) (xT w + b) − 1 + ξi − + +ri ξ i . +i=1 + +Here, the αi ’s and ri ’s are our Lagrange multipliers (constrained to be ≥ 0). +We won’t go through the derivation of the dual again in detail, but after +setting the derivatives with respect to w and b to zero as before, substituting +them back in, and simplifying, we obtain the following dual form of the +problem: +m + +maxα W (α) = +i=1 + +m + +1 +αi − +y (i) y (j) αi αj x(i) , x(j) +2 i,j=1 + +s.t. 0 ≤ αi ≤ C, i = 1, . . . , m +m + +αi y (i) = 0, +i=1 + +As before, we also have that w can be expressed in terms of the αi ’s +as given in Equation (9), so that after solving the dual problem, we can +continue to use Equation (13) to make our predictions. Note that, somewhat +surprisingly, in adding 1 regularization, the only change to the dual problem +is that what was originally a constraint that 0 ≤ αi has now become 0 ≤ +αi ≤ C. The calculation for b∗ also has to be modified (Equation 11 is no +longer valid); see the comments in the next section/Platt’s paper. +Also, the KKT dual-complementarity conditions (which in the next section will be useful for testing for the convergence of the SMO algorithm) +are: +αi = 0 ⇒ y (i) (wT x(i) + b) ≥ 1 +αi = C ⇒ y (i) (wT x(i) + b) ≤ 1 +0 < αi < C ⇒ y (i) (wT x(i) + b) = 1. + +(14) +(15) +(16) + +Now, all that remains is to give an algorithm for actually solving the dual +problem, which we will do in the next section. + +9 + +The SMO algorithm + +The SMO (sequential minimal optimization) algorithm, due to John Platt, +gives an efficient way of solving the dual problem arising from the derivation + + 21 +of the SVM. Partly to motivate the SMO algorithm, and partly because it’s +interesting in its own right, lets first take another digression to talk about +the coordinate ascent algorithm. + +9.1 + +Coordinate ascent + +Consider trying to solve the unconstrained optimization problem +max W (α1 , α2 , . . . , αm ). +α + +Here, we think of W as just some function of the parameters αi ’s, and for now +ignore any relationship between this problem and SVMs. We’ve already seen +two optimization algorithms, gradient ascent and Newton’s method. The +new algorithm we’re going to consider here is called coordinate ascent: +Loop until convergence: { +For i = 1, . . . , m, { + +ˆ i , αi+1 , . . . , αm ). +αi := arg maxαˆ i W (α1 , . . . , αi−1 , α + +} +} +Thus, in the innermost loop of this algorithm, we will hold all the variables except for some αi fixed, and reoptimize W with respect to just the +parameter αi . In the version of this method presented here, the inner-loop +reoptimizes the variables in order α1 , α2 , . . . , αm , α1 , α2 , . . .. (A more sophisticated version might choose other orderings; for instance, we may choose +the next variable to update according to which one we expect to allow us to +make the largest increase in W (α).) +When the function W happens to be of such a form that the “arg max” +in the inner loop can be performed efficiently, then coordinate ascent can be +a fairly efficient algorithm. Here’s a picture of coordinate ascent in action: + + 22 +2.5 + +2 + +1.5 + +1 + +0.5 + +0 + +−0.5 + +−1 + +−1.5 + +−2 + +−2 + +−1.5 + +−1 + +−0.5 + +0 + +0.5 + +1 + +1.5 + +2 + +2.5 + +The ellipses in the figure are the contours of a quadratic function that +we want to optimize. Coordinate ascent was initialized at (2, −2), and also +plotted in the figure is the path that it took on its way to the global maximum. +Notice that on each step, coordinate ascent takes a step that’s parallel to one +of the axes, since only one variable is being optimized at a time. + +9.2 + +SMO + +We close off the discussion of SVMs by sketching the derivation of the SMO +algorithm. Some details will be left to the homework, and for others you +may refer to the paper excerpt handed out in class. +Here’s the (dual) optimization problem that we want to solve: +m + +maxα W (α) = +i=1 + +m + +1 +αi − +y (i) y (j) αi αj x(i) , x(j) . +2 i,j=1 + +s.t. 0 ≤ αi ≤ C, i = 1, . . . , m + +(17) +(18) + +m + +αi y (i) = 0. + +(19) + +i=1 + +Lets say we have set of αi ’s that satisfy the constraints (18-19). Now, +suppose we want to hold α2 , . . . , αm fixed, and take a coordinate ascent step +and reoptimize the objective with respect to α1 . Can we make any progress? +The answer is no, because the constraint (19) ensures that +m + +α1 y + +(1) + +=− + +αi y (i) . +i=2 + + 23 +Or, by multiplying both sides by y (1) , we equivalently have +m + +α1 = −y (1) + +αi y (i) . +i=2 + +(This step used the fact that y (1) ∈ {−1, 1}, and hence (y (1) )2 = 1.) Hence, +α1 is exactly determined by the other αi ’s, and if we were to hold α2 , . . . , αm +fixed, then we can’t make any change to α1 without violating the constraint (19) in the optimization problem. +Thus, if we want to update some subject of the αi ’s, we must update at +least two of them simultaneously in order to keep satisfying the constraints. +This motivates the SMO algorithm, which simply does the following: +Repeat till convergence { +1. Select some pair αi and αj to update next (using a heuristic that +tries to pick the two that will allow us to make the biggest progress +towards the global maximum). +2. Reoptimize W (α) with respect to αi and αj , while holding all the +other αk ’s (k = i, j) fixed. +} +To test for convergence of this algorithm, we can check whether the KKT +conditions (Equations 14-16) are satisfied to within some tol. Here, tol is +the convergence tolerance parameter, and is typically set to around 0.01 to +0.001. (See the paper and pseudocode for details.) +The key reason that SMO is an efficient algorithm is that the update to +αi , αj can be computed very efficiently. Lets now briefly sketch the main +ideas for deriving the efficient update. +Lets say we currently have some setting of the αi ’s that satisfy the constraints (18-19), and suppose we’ve decided to hold α3 , . . . , αm fixed, and +want to reoptimize W (α1 , α2 , . . . , αm ) with respect to α1 and α2 (subject to +the constraints). From (19), we require that +m + +α1 y + +(1) + ++ α2 y + +(2) + +=− + +αi y (i) . +i=3 + +Since the right hand side is fixed (as we’ve fixed α3 , . . . αm ), we can just let +it be denoted by some constant ζ: +α1 y (1) + α2 y (2) = ζ. +We can thus picture the constraints on α1 and α2 as follows: + +(20) + + 24 + +C + +α1y(1)+ α2y(2)=ζ + +H +α2 + +L +α1 + +C + +From the constraints (18), we know that α1 and α2 must lie within the box +[0, C] × [0, C] shown. Also plotted is the line α1 y (1) + α2 y (2) = ζ, on which we +know α1 and α2 must lie. Note also that, from these constraints, we know +L ≤ α2 ≤ H; otherwise, (α1 , α2 ) can’t simultaneously satisfy both the box +and the straight line constraint. In this example, L = 0. But depending on +what the line α1 y (1) + α2 y (2) = ζ looks like, this won’t always necessarily be +the case; but more generally, there will be some lower-bound L and some +upper-bound H on the permissable values for α2 that will ensure that α1 , α2 +lie within the box [0, C] × [0, C]. +Using Equation (20), we can also write α1 as a function of α2 : +α1 = (ζ − α2 y (2) )y (1) . +(Check this derivation yourself; we again used the fact that y (1) ∈ {−1, 1} so +that (y (1) )2 = 1.) Hence, the objective W (α) can be written +W (α1 , α2 , . . . , αm ) = W ((ζ − α2 y (2) )y (1) , α2 , . . . , αm ). +Treating α3 , . . . , αm as constants, you should be able to verify that this is +just some quadratic function in α2 . I.e., this can also be expressed in the +form aα22 + bα2 + c for some appropriate a, b, and c. If we ignore the “box” +constraints (18) (or, equivalently, that L ≤ α2 ≤ H), then we can easily +maximize this quadratic function by setting its derivative to zero and solving. +We’ll let α2new,unclipped denote the resulting value of α2 . You should also be +able to convince yourself that if we had instead wanted to maximize W with +respect to α2 but subject to the box constraint, then we can find the resulting +value optimal simply by taking α2new,unclipped and “clipping” it to lie in the + + 25 +[L, H] interval, to get + + H +α2new = +αnew,unclipped + 2 +L + +if α2new,unclipped > H +if L ≤ α2new,unclipped ≤ H +if α2new,unclipped < L + +Finally, having found the α2new , we can use Equation (20) to go back and find +the optimal value of α1new . +There’re a couple more details that are quite easy but that we’ll leave you +to read about yourself in Platt’s paper: One is the choice of the heuristics +used to select the next αi , αj to update; the other is how to update b as the +SMO algorithm is run. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes4.txt b/Lectures/aimlcs229/cs229-notes4.txt new file mode 100644 index 0000000..335f0c0 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes4.txt @@ -0,0 +1,635 @@ +CS229 Lecture notes +Andrew Ng + +Part VI + +Learning Theory +1 + +Bias/variance tradeoff + +When talking about linear regression, we discussed the problem of whether +to fit a “simple” model such as the linear “y = θ0 +θ1 x,” or a more “complex” +model such as the polynomial “y = θ0 + θ1 x + · · · θ5 x5 .” We saw the following +example: +4.5 + +4.5 + +4 + +4 + +4.5 + +4 + +3.5 + +3.5 + +3.5 + +y + +3 + +2.5 + +y + +3 + +2.5 + +y + +3 + +2.5 + +2 + +2 + +2 + +1.5 + +1.5 + +1.5 + +1 + +1 + +1 + +0.5 + +0.5 + +0.5 + +0 + +0 + +1 + +2 + +3 + +4 +x + +5 + +6 + +7 + +0 + +0 + +1 + +2 + +3 + +4 +x + +5 + +6 + +7 + +0 + +0 + +1 + +2 + +3 + +4 + +5 + +6 + +7 + +x + +Fitting a 5th order polynomial to the data (rightmost figure) did not +result in a good model. Specifically, even though the 5th order polynomial +did a very good job predicting y (say, prices of houses) from x (say, living +area) for the examples in the training set, we do not expect the model shown +to be a good one for predicting the prices of houses not in the training set. In +other words, what’s has been learned from the training set does not generalize +well to other houses. The generalization error (which will be made formal +shortly) of a hypothesis is its expected error on examples not necessarily in +the training set. +Both the models in the leftmost and the rightmost figures above have +large generalization error. However, the problems that the two models suffer +from are very different. If the relationship between y and x is not linear, +1 + + 2 +then even if we were fitting a linear model to a very large amount of training +data, the linear model would still fail to accurately capture the structure +in the data. Informally, we define the bias of a model to be the expected +generalization error even if we were to fit it to a very (say, infinitely) large +training set. Thus, for the problem above, the linear model suffers from large +bias, and may underfit (i.e., fail to capture structure exhibited by) the data. +Apart from bias, there’s a second component to the generalization error, +consisting of the variance of a model fitting procedure. Specifically, when +fitting a 5th order polynomial as in the rightmost figure, there is a large risk +that we’re fitting patterns in the data that happened to be present in our +small, finite training set, but that do not reflect the wider pattern of the +relationship between x and y. This could be, say, because in the training set +we just happened by chance to get a slightly more-expensive-than-average +house here, and a slightly less-expensive-than-average house there, and so +on. By fitting these “spurious” patterns in the training set, we might again +obtain a model with large generalization error. In this case, we say the model +has large variance.1 +Often, there is a tradeoff between bias and variance. If our model is too +“simple” and has very few parameters, then it may have large bias (but small +variance); if it is too “complex” and has very many parameters, then it may +suffer from large variance (but have smaller bias). In the example above, +fitting a quadratic function does better than either of the extremes of a first +or a fifth order polynomial. + +2 + +Preliminaries + +In this set of notes, we begin our foray into learning theory. Apart from +being interesting and enlightening in its own right, this discussion will also +help us hone our intuitions and derive rules of thumb about how to best +apply learning algorithms in different settings. We will also seek to answer +a few questions: First, can we make formal the bias/variance tradeoff that +was just discussed? The will also eventually lead us to talk about model +selection methods, which can, for instance, automatically decide what order +polynomial to fit to a training set. Second, in machine learning it’s really +1 + +In these notes, we will not try to formalize the definitions of bias and variance beyond +this discussion. While bias and variance are straightforward to define formally for, e.g., +linear regression, there have been several proposals for the definitions of bias and variance +for classification, and there is as yet no agreement on what is the “right” and/or the most +useful formalism. + + 3 +generalization error that we care about, but most learning algorithms fit their +models to the training set. Why should doing well on the training set tell us +anything about generalization error? Specifically, can we relate error on the +training set to generalization error? Third and finally, are there conditions +under which we can actually prove that learning algorithms will work well? +We start with two simple but very useful lemmas. +Lemma. (The union bound). Let A1 , A2 , . . . , Ak be k different events (that +may not be independent). Then +P (A1 ∪ · · · ∪ Ak ) ≤ P (A1 ) + . . . + P (Ak ). +In probability theory, the union bound is usually stated as an axiom +(and thus we won’t try to prove it), but it also makes intuitive sense: The +probability of any one of k events happening is at most the sums of the +probabilities of the k different events. +Lemma. (Hoeffding inequality) Let Z1 , . . . , Zm be m independent and identically distributed (iid) random variables drawn from a Bernoulli(φ) distribution. I.e., P (Zi = 1) = φ, and P (Zi = 0) = 1 − φ. Let φˆ = (1/m) m +i=1 Zi +be the mean of these random variables, and let any γ > 0 be fixed. Then +ˆ > γ) ≤ 2 exp(−2γ 2 m) +P (|φ − φ| +This lemma (which in learning theory is also called the Chernoff bound) +ˆ +says that if we take φ—the +average of m Bernoulli(φ) random variables—to +be our estimate of φ, then the probability of our being far from the true value +is small, so long as m is large. Another way of saying this is that if you have +a biased coin whose chance of landing on heads is φ, then if you toss it m +times and calculate the fraction of times that it came up heads, that will be +a good estimate of φ with high probability (if m is large). +Using just these two lemmas, we will be able to prove some of the deepest +and most important results in learning theory. +To simplify our exposition, lets restrict our attention to binary classification in which the labels are y ∈ {0, 1}. Everything we’ll say here generalizes +to other, including regression and multi-class classification, problems. +We assume we are given a training set S = {(x(i) , y (i) ); i = 1, . . . , m} +of size m, where the training examples (x(i) , y (i) ) are drawn iid from some +probability distribution D. For a hypothesis h, we define the training error +(also called the empirical risk or empirical error in learning theory) to +be +m +1 +1{h(x(i) ) = y (i) }. +εˆ(h) = +m i=1 + + 4 +This is just the fraction of training examples that h misclassifies. When we +want to make explicit the dependence of εˆ(h) on the training set S, we may +also write this a εˆS (h). We also define the generalization error to be +ε(h) = P(x,y)∼D (h(x) = y). +I.e. this is the probability that, if we now draw a new example (x, y) from +the distribution D, h will misclassify it. +Note that we have assumed that the training data was drawn from the +same distribution D with which we’re going to evaluate our hypotheses (in +the definition of generalization error). This is sometimes also referred to as +one of the PAC assumptions.2 +Consider the setting of linear classification, and let hθ (x) = 1{θ T x ≥ 0}. +What’s a reasonable way of fitting the parameters θ? One approach is to try +to minimize the training error, and pick +θˆ = arg min εˆ(hθ ). +θ + +We call this process empirical risk minimization (ERM), and the resulting +ˆ = h ˆ. We think of ERM +hypothesis output by the learning algorithm is h +θ +as the most “basic” learning algorithm, and it will be this algorithm that we +focus on in these notes. (Algorithms such as logistic regression can also be +viewed as approximations to empirical risk minimization.) +In our study of learning theory, it will be useful to abstract away from +the specific parameterization of hypotheses and from issues such as whether +we’re using a linear classifier. We define the hypothesis class H used by a +learning algorithm to be the set of all classifiers considered by it. For linear +classification, H = {hθ : hθ (x) = 1{θ T x ≥ 0}, θ ∈ Rn+1 } is thus the set of +all classifiers over X (the domain of the inputs) where the decision boundary +is linear. More broadly, if we were studying, say, neural networks, then we +could let H be the set of all classifiers representable by some neural network +architecture. +Empirical risk minimization can now be thought of as a minimization over +the class of functions H, in which the learning algorithm picks the hypothesis: +ˆ = arg min εˆ(h) +h +h∈H + +2 + +PAC stands for “probably approximately correct,” which is a framework and set of +assumptions under which numerous results on learning theory were proved. Of these, the +assumption of training and testing on the same distribution, and the assumption of the +independently drawn training examples, were the most important. + + 5 + +3 + +The case of finite H + +Lets start by considering a learning problem in which we have a finite hypothesis class H = {h1 , . . . , hk } consisting of k hypotheses. Thus, H is just a +set of k functions mapping from X to {0, 1}, and empirical risk minimization +ˆ to be whichever of these k functions has the smallest training error. +selects h +ˆ Our +We would like to give guarantees on the generalization error of h. +strategy for doing so will be in two parts: First, we will show that εˆ(h) is a +reliable estimate of ε(h) for all h. Second, we will show that this implies an +ˆ +upper-bound on the generalization error of h. +Take any one, fixed, hi ∈ H. Consider a Bernoulli random variable Z +whose distribution is defined as follows. We’re going to sample (x, y) ∼ D. +Then, we set Z = 1{hi (x) = y}. I.e., we’re going to draw one example, +and let Z indicate whether hi misclassifies it. Similarly, we also define Zj = +1{hi (x(j) ) = y (j) }. Since our training set was drawn iid from D, Z and the +Zj ’s have the same distribution. +We see that the misclassification probability on a randomly drawn example— +that is, ε(h)—is exactly the expected value of Z (and Zj ). Moreover, the +training error can be written +1 +εˆ(hi ) = +m + +m + +Zj . +j=1 + +Thus, εˆ(hi ) is exactly the mean of the m random variables Zj that are drawn +iid from a Bernoulli distribution with mean ε(hi ). Hence, we can apply the +Hoeffding inequality, and obtain +P (|ε(hi ) − εˆ(hi )| > γ) ≤ 2 exp(−2γ 2 m). +This shows that, for our particular hi , training error will be close to +generalization error with high probability, assuming m is large. But we +don’t just want to guarantee that ε(hi ) will be close to εˆ(hi ) (with high +probability) for just only one particular hi . We want to prove that this will +be true for simultaneously for all h ∈ H. To do so, let Ai denote the event +that |ε(hi ) − εˆ(hi )| > γ. We’ve already show that, for any particular Ai , it +holds true that P (Ai ) ≤ 2 exp(−2γ 2 m). Thus, using the union bound, we + + 6 +have that +P (∃ h ∈ H.|ε(hi ) − εˆ(hi )| > γ) = P (A1 ∪ · · · ∪ Ak ) +k + +≤ +≤ + +P (Ai ) +i=1 +k + +2 exp(−2γ 2 m) +i=1 + += 2k exp(−2γ 2 m) +If we subtract both sides from 1, we find that +P (¬∃ h ∈ H.|ε(hi ) − εˆ(hi )| > γ) = P (∀h ∈ H.|ε(hi ) − εˆ(hi )| ≤ γ) +≥ 1 − 2k exp(−2γ 2 m) +(The “¬” symbol means “not.”) So, with probability at least 1−2k exp(−2γ 2 m), +we have that ε(h) will be within γ of εˆ(h) for all h ∈ H. This is called a uniform convergence result, because this is a bound that holds simultaneously +for all (as opposed to just one) h ∈ H. +In the discussion above, what we did was, for particular values of m and +γ, given a bound on the probability that, for some h ∈ H, |ε(h) − εˆ(h)| > γ. +There are three quantities of interest here: m, γ, and the probability of error; +we can bound either one in terms of the other two. +For instance, we can ask the following question: Given γ and some δ > 0, +how large must m be before we can guarantee that with probability at least +1 − δ, training error will be within γ of generalization error? By setting +δ = 2k exp(−2γ 2 m) and solving for m, [you should convince yourself this is +the right thing to do!], we find that if +m≥ + +2k +1 +log +, +2γ 2 +δ + +then with probability at least 1 − δ, we have that |ε(h) − εˆ(h)| ≤ γ for all +h ∈ H. (Equivalently, this show that the probability that |ε(h) − εˆ(h)| > γ +for some h ∈ H is at most δ.) This bound tells us how many training +examples we need in order make a guarantee. The training set size m that +a certain method or algorithm requires in order to achieve a certain level of +performance is also called the algorithm’s sample complexity. +The key property of the bound above is that the number of training +examples needed to make this guarantee is only logarithmic in k, the number +of hypotheses in H. This will be important later. + + 7 +Similarly, we can also hold m and δ fixed and solve for γ in the previous +equation, and show [again, convince yourself that this is right!] that with +probability 1 − δ, we have that for all h ∈ H, +|ˆ +ε(h) − ε(h)| ≤ + +1 +2k +log . +2m +δ + +Now, lets assume that uniform convergence holds, i.e., that |ε(h)− εˆ(h)| ≤ +γ for all h ∈ H. What can we prove about the generalization of our learning +ˆ = arg minh∈H εˆ(h)? +algorithm that picked h +Define h∗ = arg minh∈H ε(h) to be the best possible hypothesis in H. Note +that h∗ is the best that we could possibly do given that we are using H, so +it makes sense to compare our performance to that of h∗ . We have: +ˆ ≤ εˆ(h) +ˆ +γ +ε(h) +≤ εˆ(h∗ ) + γ +≤ ε(h∗ ) + 2γ +ˆ εˆ(h)| +ˆ ≤ γ (by our uniform convergence +The first line used the fact that |ε(h)− +ˆ was chosen to minimize εˆ(h), +assumption). The second used the fact that h +ˆ ≤ εˆ(h) for all h, and in particular εˆ(h) +ˆ ≤ εˆ(h∗ ). The third +and hence εˆ(h) +line used the uniform convergence assumption again, to show that εˆ(h∗ ) ≤ +ε(h∗ ) + γ. So, what we’ve shown is the following: If uniform convergence +ˆ is at most 2γ worse than the best +occurs, then the generalization error of h +possible hypothesis in H! +Lets put all this together into a theorem. +Theorem. Let |H| = k, and let any m, δ be fixed. Then with probability at +least 1 − δ, we have that +ˆ ≤ +ε(h) + +min ε(h) + 2 +h∈H + +1 +2k +log . +2m +δ + +√ +This is proved by letting γ equal the · term, using our previous argument that uniform convergence occurs with probability at least 1 − δ, and +then noting that uniform convergence implies ε(h) is at most 2γ higher than +ε(h∗ ) = minh∈H ε(h) (as we showed previously). +This also quantifies what we were saying previously saying about the +bias/variance tradeoff in model selection. Specifically, suppose we have some +hypothesis class H, and are considering switching to some much larger hypothesis class H ⊇ H. If we switch to H , then the first term minh ε(h) + + 8 +can only decrease (since we’d then be taking a min over a larger set of functions). Hence, by learning using a larger hypothesis class, +√ our “bias” can +only decrease. However, if k increases, then the second 2 · term would also +increase. This increase corresponds to our “variance” increasing when we use +a larger hypothesis class. +By holding γ and δ fixed and solving for m like we did before, we can +also obtain the following sample complexity bound: +ˆ ≤ +Corollary. Let |H| = k, and let any δ, γ be fixed. Then for ε(h) +minh∈H ε(h) + 2γ to hold with probability at least 1 − δ, it suffices that +2k +1 +log +2γ 2 +δ +k +1 +log += O +γ2 +δ + +m ≥ + +4 + +, + +The case of infinite H + +We have proved some useful theorems for the case of finite hypothesis classes. +But many hypothesis classes, including any parameterized by real numbers +(as in linear classification) actually contain an infinite number of functions. +Can we prove similar results for this setting? +Lets start by going through something that is not the “right” argument. +Better and more general arguments exist, but this will be useful for honing +our intuitions about the domain. +Suppose we have an H that is parameterized by d real numbers. Since we +are using a computer to represent real numbers, and IEEE double-precision +floating point (double’s in C) uses 64 bits to represent a floating point number, this means that our learning algorithm, assuming we’re using doubleprecision floating point, is parameterized by 64d bits. Thus, our hypothesis +class really consists of at most k = 264d different hypotheses. From the Corollary at the end of the previous section, we therefore find that, to guarantee +ˆ ≤ ε(h∗ ) + 2γ, with to hold with probability at least 1 − δ, it suffices +ε(h) +64d += O γd2 log 1δ = Oγ,δ (d). (The γ, δ subscripts are +that m ≥ O γ12 log 2 δ +to indicate that the last big-O is hiding constants that may depend on γ and +δ.) Thus, the number of training examples needed is at most linear in the +parameters of the model. +The fact that we relied on 64-bit floating point makes this argument not +entirely satisfying, but the conclusion is nonetheless roughly correct: If what +we’re going to do is try to minimize training error, then in order to learn + + 9 +“well” using a hypothesis class that has d parameters, generally we’re going +to need on the order of a linear number of training examples in d. +(At this point, it’s worth noting that these results were proved for an algorithm that uses empirical risk minimization. Thus, while the linear dependence of sample complexity on d does generally hold for most discriminative +learning algorithms that try to minimize training error or some approximation to training error, these conclusions do not always apply as readily to +discriminative learning algorithms. Giving good theoretical guarantees on +many non-ERM learning algorithms is still an area of active research.) +The other part of our previous argument that’s slightly unsatisfying is +that it relies on the parameterization of H. Intuitively, this doesn’t seem like +it should matter: We had written the class of linear classifiers as hθ (x) = +1{θ0 + θ1 x1 + · · · θn xn ≥ 0}, with n + 1 parameters θ0 , . . . , θn . But it could +also be written hu,v (x) = 1{(u20 − v02 ) + (u21 − v12 )x1 + · · · (u2n − vn2 )xn ≥ 0} +with 2n + 2 parameters ui , vi . Yet, both of these are just defining the same +H: The set of linear classifiers in n dimensions. +To derive a more satisfying argument, lets define a few more things. +Given a set S = {x(i) , . . . , x(d) } (no relation to the training set) of points +x(i) ∈ X , we say that H shatters S if H can realize any labeling on S. +I.e., if for any set of labels {y (1) , . . . , y (d) }, there exists some h ∈ H so that +h(x(i) ) = y (i) for all i = 1, . . . d. +Given a hypothesis class H, we then define its Vapnik-Chervonenkis +dimension, written VC(H), to be the size of the largest set that is shattered +by H. (If H can shatter arbitrarily large sets, then VC(H) = ∞.) +For instance, consider the following set of three points: + + ✁ + +☎✆ + +x2 +✂✄ + +x1 +Can the set H of linear classifiers in two dimensions (h(x) = 1{θ0 +θ1 x1 + +θ2 x2 ≥ 0}) can shatter the set above? The answer is yes. Specifically, we + + 10 +see that, for any of the eight possible labelings of these points, we can find a +linear classifier that obtains “zero training error” on them: + +x2 + +x2 +x1 + +x2 +x1 + +x1 + +x1 + +x1 + +x1 + +x2 + +x2 + +x2 + +x2 + +x2 + +x1 + +x1 + +Moreover, it is possible to show that there is no set of 4 points that this +hypothesis class can shatter. Thus, the largest set that H can shatter is of +size 3, and hence VC(H) = 3. +Note that the VC dimension of H here is 3 even though there may be +sets of size 3 that it cannot shatter. For instance, if we had a set of three +points lying in a straight line (left figure), then there is no way to find a linear +separator for the labeling of the three points shown below (right figure): + + ✁ + +✂✄ + +x2 + +x2 +☎✆ + +x1 + +x1 + +In order words, under the definition of the VC dimension, in order to +prove that VC(H) is at least d, we need to show only that there’s at least +one set of size d that H can shatter. +The following theorem, due to Vapnik, can then be shown. (This is, many +would argue, the most important theorem in all of learning theory.) + + 11 +Theorem. Let H be given, and let d = VC(H). Then with probability at +least 1 − δ, we have that for all h ∈ H, +|ε(h) − εˆ(h)| ≤ O + +m +1 +1 +d +log + log +m +d +m +δ + +. + +Thus, with probability at least 1 − δ, we also have that: +ˆ ≤ ε(h∗ ) + O +ε(h) + +m +1 +1 +d +log + log +m +d +m +δ + +. + +In other words, if a hypothesis class has finite VC dimension, then uniform +convergence occurs as m becomes large. As before, this allows us to give a +bound on ε(h) in terms of ε(h∗ ). We also have the following corollary: +ˆ ≤ +Corollary. For |ε(h) − εˆ(h)| ≤ γ to hold for all h ∈ H (and hence ε(h) +ε(h∗ ) + 2γ) with probability at least 1 − δ, it suffices that m = Oγ,δ (d). + +In other words, the number of training examples needed to learn “well” +using H is linear in the VC dimension of H. It turns out that, for “most” +hypothesis classes, the VC dimension (assuming a “reasonable” parameterization) is also roughly linear in the number of parameters. Putting these +together, we conclude that (for an algorithm that tries to minimize training +error) the number of training examples needed is usually roughly linear in +the number of parameters of H. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes5.txt b/Lectures/aimlcs229/cs229-notes5.txt new file mode 100644 index 0000000..0ca43da --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes5.txt @@ -0,0 +1,315 @@ +CS229 Lecture notes +Andrew Ng + +Part VI + +Regularization and model +selection +Suppose we are trying select among several different models for a learning +problem. For instance, we might be using a polynomial regression model +hθ (x) = g(θ0 + θ1 x + θ2 x2 + · · · + θk xk ), and wish to decide if k should be +0, 1, . . . , or 10. How can we automatically select a model that represents +a good tradeoff between the twin evils of bias and variance1 ? Alternatively, +suppose we want to automatically choose the bandwidth parameter τ for +locally weighted regression, or the parameter C for our 1 -regularized SVM. +How can we do that? +For the sake of concreteness, in these notes we assume we have some +finite set of models M = {M1 , . . . , Md } that we’re trying to select among. +For instance, in our first example above, the model Mi would be an i-th +order polynomial regression model. (The generalization to infinite M is not +hard.2 ) Alternatively, if we are trying to decide between using an SVM, a +neural network or logistic regression, then M may contain these models. +1 + +Given that we said in the previous set of notes that bias and variance are two very +different beasts, some readers may be wondering if we should be calling them “twin” evils +here. Perhaps it’d be better to think of them as non-identical twins. The phrase “the +fraternal twin evils of bias and variance” doesn’t have the same ring to it, though. +2 +If we are trying to choose from an infinite set of models, say corresponding to the +possible values of the bandwidth τ ∈ R+ , we may discretize τ and consider only a finite +number of possible values for it. More generally, most of the algorithms described here +can all be viewed as performing optimization search in the space of models, and we can +perform this search over infinite model classes as well. + +1 + + 2 + +1 + +Cross validation + +Lets suppose we are, as usual, given a training set S. Given what we know +about empirical risk minimization, here’s what might initially seem like a +algorithm, resulting from using empirical risk minimization for model selection: +1. Train each model Mi on S, to get some hypothesis hi . +2. Pick the hypotheses with the smallest training error. +This algorithm does not work. Consider choosing the order of a polynomial. The higher the order of the polynomial, the better it will fit the +training set S, and thus the lower the training error. Hence, this method will +always select a high-variance, high-degree polynomial model, which we saw +previously is often poor choice. +Here’s an algorithm that works better. In hold-out cross validation +(also called simple cross validation), we do the following: +1. Randomly split S into Strain (say, 70% of the data) and Scv (the remaining 30%). Here, Scv is called the hold-out cross validation set. +2. Train each model Mi on Strain only, to get some hypothesis hi . +3. Select and output the hypothesis hi that had the smallest error εˆScv (hi ) +on the hold out cross validation set. (Recall, εˆScv (h) denotes the empirical error of h on the set of examples in Scv .) +By testing on a set of examples Scv that the models were not trained on, +we obtain a better estimate of each hypothesis hi ’s true generalization error, +and can then pick the one with the smallest estimated generalization error. +Usually, somewhere between 1/4 − 1/3 of the data is used in the hold out +cross validation set, and 30% is a typical choice. +Optionally, step 3 in the algorithm may also be replaced with selecting +the model Mi according to arg mini εˆScv (hi ), and then retraining Mi on the +entire training set S. (This is often a good idea, with one exception being +learning algorithms that are be very sensitive to perturbations of the initial +conditions and/or data. For these methods, Mi doing well on Strain does not +necessarily mean it will also do well on Scv , and it might be better to forgo +this retraining step.) +The disadvantage of using hold out cross validation is that it “wastes” +about 30% of the data. Even if we were to take the optional step of retraining + + 3 +the model on the entire training set, it’s still as if we’re trying to find a good +model for a learning problem in which we had 0.7m training examples, rather +than m training examples, since we’re testing models that were trained on +only 0.7m examples each time. While this is fine if data is abundant and/or +cheap, in learning problems in which data is scarce (consider a problem with +m = 20, say), we’d like to do something better. +Here is a method, called k-fold cross validation, that holds out less +data each time: +1. Randomly split S into k disjoint subsets of m/k training examples each. +Lets call these subsets S1 , . . . , Sk . +2. For each model Mi , we evaluate it as follows: +For j = 1, . . . , k +Train the model Mi on S1 ∪ · · · ∪ Sj−1 ∪ Sj+1 ∪ · · · Sk (i.e., train +on all the data except Sj ) to get some hypothesis hij . +Test the hypothesis hij on Sj , to get εˆSj (hij ). +The estimated generalization error of model Mi is then calculated +as the average of the εˆSj (hij )’s (averaged over j). +3. Pick the model Mi with the lowest estimated generalization error, and +retrain that model on the entire training set S. The resulting hypothesis +is then output as our final answer. +A typical choice for the number of folds to use here would be k = 10. +While the fraction of data held out each time is now 1/k—much smaller +than before—this procedure may also be more computationally expensive +than hold-out cross validation, since we now need train to each model k +times. +While k = 10 is a commonly used choice, in problems in which data is +really scarce, sometimes we will use the extreme choice of k = m in order +to leave out as little data as possible each time. In this setting, we would +repeatedly train on all but one of the training examples in S, and test on that +held-out example. The resulting m = k errors are then averaged together to +obtain our estimate of the generalization error of a model. This method has +its own name; since we’re holding out one training example at a time, this +method is called leave-one-out cross validation. +Finally, even though we have described the different versions of cross validation as methods for selecting a model, they can also be used more simply to +evaluate a single model or algorithm. For example, if you have implemented + + 4 +some learning algorithm and want to estimate how well it performs for your +application (or if you have invented a novel learning algorithm and want to +report in a technical paper how well it performs on various test sets), cross +validation would give a reasonable way of doing so. + +2 + +Feature Selection + +One special and important case of model selection is called feature selection. +To motivate this, imagine that you have a supervised learning problem where +the number of features n is very large (perhaps n +m), but you suspect that +there is only a small number of features that are “relevant” to the learning +task. Even if you use the a simple linear classifier (such as the perceptron) +over the n input features, the VC dimension of your hypothesis class would +still be O(n), and thus overfitting would be a potential problem unless the +training set is fairly large. +In such a setting, you can apply a feature selection algorithm to reduce the +number of features. Given n features, there are 2n possible feature subsets +(since each of the n features can either be included or excluded from the +subset), and thus feature selection can be posed as a model selection problem +over 2n possible models. For large values of n, it’s usually too expensive to +explicitly enumerate over and compare all 2n models, and so typically some +heuristic search procedure is used to find a good feature subset. The following +search procedure is called forward search: +1. Initialize F = ∅. +2. Repeat { +(a) For i = 1, . . . , n if i ∈ F, let Fi = F ∪ {i}, and use some version of cross validation to evaluate features Fi . (I.e., train your +learning algorithm using only the features in Fi , and estimate its +generalization error.) +(b) Set F to be the best feature subset found on step (a). +} +3. Select and output the best feature subset that was evaluated during the +entire search procedure. + + 5 +The outer loop of the algorithm can be terminated either when F = +{1, . . . , n} is the set of all features, or when |F| exceeds some pre-set threshold (corresponding to the maximum number of features that you want the +algorithm to consider using). +This algorithm described above one instantiation of wrapper model +feature selection, since it is a procedure that “wraps” around your learning +algorithm, and repeatedly makes calls to the learning algorithm to evaluate +how well it does using different feature subsets. Aside from forward search, +other search procedures can also be used. For example, backward search +starts off with F = {1, . . . , n} as the set of all features, and repeatedly deletes +features one at a time (evaluating single-feature deletions in a similar manner +to how forward search evaluates single-feature additions) until F = ∅. +Wrapper feature selection algorithms often work quite well, but can be +computationally expensive given how that they need to make many calls to +the learning algorithm. Indeed, complete forward search (terminating when +F = {1, . . . , n}) would take about O(n2 ) calls to the learning algorithm. +Filter feature selection methods give heuristic, but computationally +much cheaper, ways of choosing a feature subset. The idea here is to compute +some simple score S(i) that measures how informative each feature xi is about +the class labels y. Then, we simply pick the k features with the largest scores +S(i). +One possible choice of the score would be define S(i) to be (the absolute +value of) the correlation between xi and y, as measured on the training data. +This would result in our choosing the features that are the most strongly +correlated with the class labels. In practice, it is more common (particularly +for discrete-valued features xi ) to choose S(i) to be the mutual information +MI(xi , y) between xi and y: +MI(xi , y) = + +p(xi , y) log +xi ∈{0,1} y∈{0,1} + +p(xi , y) +. +p(xi )p(y) + +(The equation above assumes that xi and y are binary-valued; more generally +the summations would be over the domains of the variables.) The probabilities above p(xi , y), p(xi ) and p(y) can all be estimated according to their +empirical distributions on the training set. +To gain intuition about what this score does, note that the mutual information can also be expressed as a Kullback-Leibler (KL) divergence: +MI(xi , y) = KL (p(xi , y)||p(xi )p(y)) +You’ll get to play more with KL-divergence in Problem set #3, but informally, this gives a measure of how different the probability distributions + + 6 +p(xi , y) and p(xi )p(y) are. If xi and y are independent random variables, +then we would have p(xi , y) = p(xi )p(y), and the KL-divergence between the +two distributions will be zero. This is consistent with the idea if xi and y +are independent, then xi is clearly very “non-informative” about y, and thus +the score S(i) should be small. Conversely, if xi is very “informative” about +y, then their mutual information MI(xi , y) would be large. +One final detail: Now that you’ve ranked the features according to their +scores S(i), how do you decide how many features k to choose? Well, one +standard way to do so is to use cross validation to select among the possible +values of k. For example, when applying naive Bayes to text classification— +a problem where n, the vocabulary size, is usually very large—using this +method to select a feature subset often results in increased classifier accuracy. + +3 + +Bayesian statistics and regularization + +In this section, we will talk about one more tool in our arsenal for our battle +against overfitting. +At the beginning of the quarter, we talked about parameter fitting using +maximum likelihood (ML), and chose our parameters according to +m + +p(y (i) |x(i) ; θ). + +θML = arg max +θ + +i=1 + +Throughout our subsequent discussions, we viewed θ as an unknown parameter of the world. This view of the θ as being constant-valued but unknown +is taken in frequentist statistics. In the frequentist this view of the world, θ +is not random—it just happens to be unknown—and it’s our job to come up +with statistical procedures (such as maximum likelihood) to try to estimate +this parameter. +An alternative way to approach our parameter estimation problems is to +take the Bayesian view of the world, and think of θ as being a random +variable whose value is unknown. In this approach, we would specify a +prior distribution p(θ) on θ that expresses our “prior beliefs” about the +parameters. Given a training set S = {(x(i) , y (i) )}m +i=1 , when we are asked to +make a prediction on a new value of x, we can then compute the posterior + + 7 +distribution on the parameters +p(S|θ)p(θ) +p(S) +m +(i) (i) +i=1 p(y |x , θ) p(θ) += +(i) (i) +( m +i=1 p(y |x , θ)p(θ)) dθ +θ + +p(θ|S) = + +(1) + +In the equation above, p(y (i) |x(i) , θ) comes from whatever model you’re using +for your learning problem. For example, if you are using Bayesian logistic re(i) +(i) +gression, then you might choose p(y (i) |x(i) , θ) = hθ (x(i) )y (1−hθ (x(i) ))(1−y ) , +where hθ (x(i) ) = 1/(1 + exp(−θ T x(i) )).3 +When we are given a new test example x and asked to make it prediction +on it, we can compute our posterior distribution on the class label using the +posterior distribution on θ: +p(y|x, S) = + +p(y|x, θ)p(θ|S)dθ + +(2) + +θ + +In the equation above, p(θ|S) comes from Equation (1). Thus, for example, +if the goal is to the predict the expected value of y given x, then we would +output4 +E[y|x, S] = + +yp(y|x, S)dy +y + +The procedure that we’ve outlined here can be thought of as doing “fully +Bayesian” prediction, where our prediction is computed by taking an average +with respect to the posterior p(θ|S) over θ. Unfortunately, in general it is +computationally very difficult to compute this posterior distribution. This is +because it requires taking integrals over the (usually high-dimensional) θ as +in Equation (1), and this typically cannot be done in closed-form. +Thus, in practice we will instead approximate the posterior distribution +for θ. One common approximation is to replace our posterior distribution for +θ (as in Equation 2) with a single point estimate. The MAP (maximum +a posteriori) estimate for θ is given by +m + +p(y (i) |x(i) , θ)p(θ). + +θMAP = arg max +θ + +3 + +(3) + +i=1 + +Since we are now viewing θ as a random variable, it is okay to condition on it value, +and write “p(y|x, θ)” instead of “p(y|x; θ).” +4 +The integral below would be replaced by a summation if y is discrete-valued. + + 8 +Note that this is the same formulas as for the ML (maximum likelihood) +estimate for θ, except for the prior p(θ) term at the end. +In practical applications, a common choice for the prior p(θ) is to assume +that θ ∼ N (0, τ 2 I). Using this choice of prior, the fitted parameters θMAP +will have smaller norm than that selected by maximum likelihood. (See +Problem Set #3.) In practice, this causes the Bayesian MAP estimate to be +less susceptible to overfitting than the ML estimate of the parameters. For +example, Bayesian logistic regression turns out to be an effective algorithm for +text classification, even though in text classification we usually have n +m. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes6.txt b/Lectures/aimlcs229/cs229-notes6.txt new file mode 100644 index 0000000..31085b1 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes6.txt @@ -0,0 +1,114 @@ +CS229 Lecture notes +Andrew Ng + +1 + +The perceptron and large margin classifiers + +In this final set of notes on learning theory, we will introduce a different +model of machine learning. Specifically, we have so far been considering +batch learning settings in which we are first given a training set to learn +with, and our hypothesis h is then evaluated on separate test data. In this set +of notes, we will consider the online learning setting in which the algorithm +has to make predictions continuously even while it’s learning. +In this setting, the learning algorithm is given a sequence of examples +(x(1) , y (1) ), (x(2) , y (2) ), . . . (x(m) , y (m) ) in order. Specifically, the algorithm first +sees x(1) and is asked to predict what it thinks y (1) is. After making its prediction, the true value of y (1) is revealed to the algorithm (and the algorithm +may use this information to perform some learning). The algorithm is then +shown x(2) and again asked to make a prediction, after which y (2) is revealed, +and it may again perform some more learning. This proceeds until we reach +(x(m) , y (m) ). In the online learning setting, we are interested in the total +number of errors made by the algorithm during this process. Thus, it models +applications in which the algorithm has to make predictions even while it’s +still learning. +We will give a bound on the online learning error of the perceptron algorithm. To make our subsequent derivations easier, we will use the notational +convention of denoting the class labels by y =∈ {−1, 1}. +Recall that the perceptron algorithm has parameters θ ∈ Rn+1 , and makes +its predictions according to +hθ (x) = g(θ T x) +where +g(z) = + +1 +if z ≥ 0 +−1 if z < 0. +1 + +(1) + + 2 + +CS229 Winter 2003 + +Also, given a training example (x, y), the perceptron learning rule updates +the parameters as follows. If hθ (x) = y, then it makes no change to the +parameters. Otherwise, it performs the update1 +θ := θ + yx. +The following theorem gives a bound on the online learning error of the +perceptron algorithm, when it is run as an online algorithm that performs +an update each time it gets an example wrong. Note that the bound below +on the number of errors does not have an explicit dependence on the number +of examples m in the sequence, or on the dimension n of the inputs (!). +Theorem (Block, 1962, and Novikoff, 1962). Let a sequence of examples (x(1) , y (1) ), (x(2) , y (2) ), . . . (x(m) , y (m) ) be given. Suppose that ||x(i) || ≤ D +for all i, and further that there exists a unit-length vector u (||u|| 2 = 1) such +that y (i) · (uT x(i) ) ≥ γ for all examples in the sequence (i.e., uT x(i) ≥ γ if +y (i) = 1, and uT x(i) ≤ −γ if y (i) = −1, so that u separates the data with a +margin of at least γ). Then the total number of mistakes that the perceptron +algorithm makes on this sequence is at most (D/γ)2 . +Proof. The perceptron updates its weights only on those examples on which +it makes a mistake. Let θ (k) be the weights that were being used when it made +its k-th mistake. So, θ (1) = 0 (since the weights are initialized to zero), and +if the k-th mistake was on the example (x(i) , y (i) ), then g((x(i) )T θ(k) ) = y (i) , +which implies that +(x(i) )T θ(k) y (i) ≤ 0. +(2) + +Also, from the perceptron learning rule, we would have that θ (k+1) = θ(k) + +y (i) x(i) . +We then have +(θ(k+1) )T u = (θ(k) )T u + y (i) (x(i) )T u +≥ (θ(k) )T u + γ +By a straightforward inductive argument, implies that +(θ(k+1) )T u ≥ kγ. +1 + +(3) + +This looks slightly different from the update rule we had written down earlier in the +quarter because here we have changed the labels to be y ∈ {−1, 1}. Also, the learning rate +parameter α was dropped. The only effect of the learning rate is to scale all the parameters +θ by some fixed constant, which does not affect the behavior of the perceptron. + + 3 + +CS229 Winter 2003 +Also, we have that +||θ(k+1) ||2 = += +≤ +≤ + +||θ(k) + y (i) x(i) ||2 +||θ(k) ||2 + ||x(i) ||2 + 2y (i) (x(i) )T θ(i) +||θ(k) ||2 + ||x(i) ||2 +||θ(k) ||2 + D2 + +(4) + +The third step above used Equation (2). Moreover, again by applying a +straightfoward inductive argument, we see that (4) implies +||θ(k+1) ||2 ≤ kD2 . + +(5) + +Putting together (3) and (4) we find that +√ +kD ≥ ||θ (k+1) || +≥ (θ(k+1) )T u +≥ kγ. +The second inequality above follows from the fact that u is a unit-length +vector (and z T u = ||z|| · ||u|| cos φ ≤ ||z|| · ||u||, where φ is the angle between +z and u). Our result implies that k ≤ (D/γ)2 . Hence, if the perceptron made +a k-th mistake, then k ≤ (D/γ)2 . + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes7a.txt b/Lectures/aimlcs229/cs229-notes7a.txt new file mode 100644 index 0000000..0aa83e6 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes7a.txt @@ -0,0 +1,92 @@ +CS229 Lecture notes +Andrew Ng + +The k-means clustering algorithm +In the clustering problem, we are given a training set {x(1) , . . . , x(m) }, and +want to group the data into a few cohesive “clusters.” Here, x(i) ∈ Rn +as usual; but no labels y (i) are given. So, this is an unsupervised learning +problem. +The k-means clustering algorithm is as follows: +1. Initialize cluster centroids µ1 , µ2 , . . . , µk ∈ Rn randomly. +2. Repeat until convergence: { +For every i, set +c(i) := arg min ||x(i) − µj ||2 . +j + +For each j, set +µj := + +m +(i) += j}x(i) +i=1 1{c +. +m +(i) = j} +i=1 1{c + +} +In the algorithm above, k (a parameter of the algorithm) is the number +of clusters we want to find; and the cluster centroids µj represent our current +guesses for the positions of the centers of the clusters. To initialize the cluster +centroids (in step 1 of the algorithm above), we could choose k training +examples randomly, and set the cluster centroids to be equal to the values of +these k examples. (Other initialization methods are also possible.) +The inner-loop of the algorithm repeatedly carries out two steps: (i) +“Assigning” each training example x(i) to the closest cluster centroid µj , and +(ii) Moving each cluster centroid µj to the mean of the points assigned to it. +Figure 1 shows an illustration of running k-means. + +1 + + 2 + +(a) + +(b) + +(c) + +(d) + +(e) + +(f) + +Figure 1: K-means algorithm. Training examples are shown as dots, and +cluster centroids are shown as crosses. (a) Original dataset. (b) Random initial cluster centroids (in this instance, not chosen to be equal to two training +examples). (c-f) Illustration of running two iterations of k-means. In each +iteration, we assign each training example to the closest cluster centroid +(shown by “painting” the training examples the same color as the cluster +centroid to which is assigned); then we move each cluster centroid to the +mean of the points assigned to it. (Best viewed in color.) Images courtesy +Michael Jordan. +Is the k-means algorithm guaranteed to converge? Yes it is, in a certain +sense. In particular, let us define the distortion function to be: +m + +||x(i) − µc(i) ||2 + +J(c, µ) = +i=1 + +Thus, J measures the sum of squared distances between each training example x(i) and the cluster centroid µc(i) to which it has been assigned. It can +be shown that k-means is exactly coordinate descent on J. Specifically, the +inner-loop of k-means repeatedly minimizes J with respect to c while holding +µ fixed, and then minimizes J with respect to µ while holding c fixed. Thus, +J must monotonically decrease, and the value of J must converge. (Usually, this implies that c and µ will converge too. In theory, it is possible for + + 3 +k-means to oscillate between a few different clusterings—i.e., a few different +values for c and/or µ—that have exactly the same value of J, but this almost +never happens in practice.) +The distortion function J is a non-convex function, and so coordinate +descent on J is not guaranteed to converge to the global minimum. In other +words, k-means can be susceptible to local optima. Very often k-means will +work fine and come up with very good clusterings despite this. But if you +are worried about getting stuck in bad local minima, one common thing to +do is run k-means many times (using different random initial values for the +cluster centroids µj ). Then, out of all the different clusterings found, pick +the one that gives the lowest distortion J(c, µ). + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes7b.txt b/Lectures/aimlcs229/cs229-notes7b.txt new file mode 100644 index 0000000..d748cef --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes7b.txt @@ -0,0 +1,187 @@ +CS229 Lecture notes +Andrew Ng + +Mixtures of Gaussians and the EM algorithm +In this set of notes, we discuss the EM (Expectation-Maximization) for density estimation. +Suppose that we are given a training set {x(1) , . . . , x(m) } as usual. Since +we are in the unsupervised learning setting, these points do not come with +any labels. +We wish to model the data by specifying a joint distribution p(x(i) , z (i) ) = +(i) (i) +p(x |z )p(z (i) ). Here, z (i) ∼ Multinomial(φ) (where φj ≥ 0, kj=1 φj = 1, +and the parameter φj gives p(z (i) = j),), and x(i) |z (i) = j ∼ N (µj , Σj ). We +let k denote the number of values that the z (i) ’s can take on. Thus, our +model posits that each x(i) was generated by randomly choosing z (i) from +{1, . . . , k}, and then x(i) was drawn from one of k Gaussians depeneding on +z (i) . This is called the mixture of Gaussians model. Also, note that the +z (i) ’s are latent random variables, meaning that they’re hidden/unobserved. +This is what will make our estimation problem difficult. +The parameters of our model are thus φ, φ and Σ. To estimate them, we +can write down the likelihood of our data: +m + +log p(x(i) ; φ, µ, Σ) + +ℓ(φ, µ, Σ) = +i=1 +m + += + +k + +p(x(i) |z (i) ; µ, Σ)p(z (i) ; φ). + +log +i=1 + +z (i) =1 + +However, if we set to zero the derivatives of this formula with respect to +the parameters and try to solve, we’ll find that it is not possible to find the +maximum likelihood estimates of the parameters in closed form. (Try this +yourself at home.) +The random variables z (i) indicate which of the k Gaussians each x(i) +had come from. Note that if we knew what the z (i) ’s were, the maximum +1 + + 2 +likelihood problem would have been easy. Specifically, we could then write +down the likelihood as +m + +log p(x(i) |z (i) ; µ, Σ) + log p(z (i) ; φ). + +ℓ(φ, µ, Σ) = +i=1 + +Maximizing this with respect to φ, µ and Σ gives the parameters: +φj = +µj = +Σj = + +1 +m + +m + +1{z (i) = j}, +i=1 +m +(i) += j}x(i) +i=1 1{z +, +m +(i) = j} +i=1 1{z +m +(i) += j}(x(i) − µj )(x(i) +i=1 1{z +m +(i) = j} +i=1 1{z + +− µj )T + +. + +Indeed, we see that if the z (i) ’s were known, then maximum likelihood +estimation becomes nearly identical to what we had when estimating the +parameters of the Gaussian discriminant analysis model, except that here +the z (i) ’s playing the role of the class labels.1 +However, in our density estimation problem, the z (i) ’s are not known. +What can we do? +The EM algorithm is an iterative algorithm that has two main steps. +Applied to our problem, in the E-step, it tries to “guess” the values of the +z (i) ’s. In the M-step, it updates the parameters of our model based on our +guesses. Since in the M-step we are pretending that the guesses in the first +part were correct, the maximization becomes easy. Here’s the algorithm: +Repeat until convergence: { +(E-step) For each i, j, set +(i) + +wj := p(z (i) = j|x(i) ; φ, µ, Σ) +1 + +There are other minor differences in the formulas here from what we’d obtained in +PS1 with Gaussian discriminant analysis, first because we’ve generalized the z (i) ’s to be +multinomial rather than Bernoulli, and second because here we are using a different Σj +for each Gaussian. + + 3 +(M-step) Update the parameters: +φj + +1 +:= +m + +µj := +Σj := + +m +(i) + +wj , +i=1 +(i) (i) +m +i=1 wj x +, +(i) +m +i=1 wj +(i) (i) +m +− µj )(x(i) +i=1 wj (x +(i) +m +i=1 wj + +− µj )T + +} +In the E-step, we calculate the posterior probability of our parameters +the z (i) ’s, given the x(i) and using the current setting of our parameters. I.e., +using Bayes rule, we obtain: +p(z (i) = j|x(i) ; φ, µ, Σ) = + +p(x(i) |z (i) = j; µ, Σ)p(z (i) = j; φ) +k +l=1 + +p(x(i) |z (i) = l; µ, Σ)p(z (i) = l; φ) + +Here, p(x(i) |z (i) = j; µ, Σ) is given by evaluating the density of a Gaussian +with mean µj and covariance Σj at x(i) ; p(z (i) = j; φ) is given by φj , and so +(i) +on. The values wj calculated in the E-step represent our “soft” guesses2 for +the values of z (i) . +Also, you should contrast the updates in the M-step with the formulas we +had when the z (i) ’s were known exactly. They are identical, except that instead of the indicator functions “1{z (i) = j}” indicating from which Gaussian +(i) +each datapoint had come, we now instead have the wj ’s. +The EM-algorithm is also reminiscent of the K-means clustering algorithm, except that instead of the “hard” cluster assignments c(i), we instead +(i) +have the “soft” assignments wj . Similar to K-means, it is also susceptible +to local optima, so reinitializing at several different initial parameters may +be a good idea. +It’s clear that the EM algorithm has a very natural interpretation of +repeatedly trying to guess the unknown z (i) ’s; but how did it come about, +and can we make any guarantees about it, such as regarding its convergence? +In the next set of notes, we will describe a more general view of EM, one +2 + +The term “soft” refers to our guesses being probabilities and taking values in [0, 1]; in +contrast, a “hard” guess is one that represents a single best guess (such as taking values +in {0, 1} or {1, . . . , k}). + + 4 +that will allow us to easily apply it to other estimation problems in which +there are also latent variables, and which will allow us to give a convergence +guarantee. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes8.txt b/Lectures/aimlcs229/cs229-notes8.txt new file mode 100644 index 0000000..572dd99 --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes8.txt @@ -0,0 +1,600 @@ +CS229 Lecture notes +Andrew Ng + +Part IX + +The EM algorithm +In the previous set of notes, we talked about the EM algorithm as applied to +fitting a mixture of Gaussians. In this set of notes, we give a broader view +of the EM algorithm, and show how it can be applied to a large family of +estimation problems with latent variables. We begin our discussion with a +very useful result called Jensen’s inequality + +1 + +Jensen’s inequality + +Let f be a function whose domain is the set of real numbers. Recall that +f is a convex function if f (x) ≥ 0 (for all x ∈ R). In the case of f taking +vector-valued inputs, this is generalized to the condition that its hessian H +is positive semi-definite (H ≥ 0). If f (x) > 0 for all x, then we say f is +strictly convex (in the vector-valued case, the corresponding statement is +that H must be strictly positive semi-definite, written H > 0). Jensen’s +inequality can then be stated as follows: +Theorem. Let f be a convex function, and let X be a random variable. +Then: +E[f (X)] ≥ f (EX). +Moreover, if f is strictly convex, then E[f (X)] = f (EX) holds true if and +only if X = E[X] with probability 1 (i.e., if X is a constant). +Recall our convention of occasionally dropping the parentheses when writing expectations, so in the theorem above, f (EX) = f (E[X]). +For an interpretation of the theorem, consider the figure below. + +1 + + 2 + +f(a) + +f + +E[f(X)] +f(b) +f(EX) +a + +E[X] + +b + +Here, f is a convex function shown by the solid line. Also, X is a random +variable that has a 0.5 chance of taking the value a, and a 0.5 chance of +taking the value b (indicated on the x-axis). Thus, the expected value of X +is given by the midpoint between a and b. +We also see the values f (a), f (b) and f (E[X]) indicated on the y-axis. +Moreover, the value E[f (X)] is now the midpoint on the y-axis between f (a) +and f (b). From our example, we see that because f is convex, it must be the +case that E[f (X)] ≥ f (EX). +Incidentally, quite a lot of people have trouble remembering which way +the inequality goes, and remembering a picture like this is a good way to +quickly figure out the answer. +Remark. Recall that f is [strictly] concave if and only if −f is [strictly] +convex (i.e., f (x) ≤ 0 or H ≤ 0). Jensen’s inequality also holds for concave +functions f , but with the direction of all the inequalities reversed (E[f (X)] ≤ +f (EX), etc.). + +2 + +The EM algorithm + +Suppose we have an estimation problem in which we have a training set +{x(1) , . . . , x(m) } consisting of m independent examples. We wish to fit the +parameters of a model p(x, z) to the data, where the likelihood is given by +m + +(θ) = + +log p(x; θ) +i=1 +m + += + +log +i=1 + +p(x, z; θ). +z + + 3 +But, explicitly finding the maximum likelihood estimates of the parameters θ +may be hard. Here, the z (i) ’s are the latent random variables; and it is often +the case that if the z (i) ’s were observed, then maximum likelihood estimation +would be easy. +In such a setting, the EM algorithm gives an efficient method for maximum likelihood estimation. Maximizing (θ) explicitly might be difficult, +and our strategy will be to instead repeatedly construct a lower-bound on +(E-step), and then optimize that lower-bound (M-step). +For each i, let Qi be some distribution over the z’s ( z Qi (z) = 1, Qi (z) ≥ +0). Consider the following:1 +log p(x(i) ; θ) = +i + +p(x(i) , z (i) ; θ) + +log +i + += + +Qi (z (i) ) + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + +(2) + +Qi (z (i) ) log + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + +(3) + +log +i + +z (i) + +≥ +i + +(1) + +z (i) + +z (i) + +The last step of this derivation used Jensen’s inequality. Specifically, f (x) = +log x is a concave function, since f (x) = −1/x2 < 0 over its domain x ∈ R+ . +Also, the term +p(x(i) , z (i) ; θ) +Qi (z (i) ) +Qi (z (i) ) +(i) +z + +in the summation is just an expectation of the quantity p(x(i) , z (i) ; θ)/Qi (z (i) ) +with respect to z (i) drawn according to the distribution given by Qi . By +Jensen’s inequality, we have +f + +Ez(i) ∼Qi + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + +≥ Ez(i) ∼Qi f + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + +, + +where the “z (i) ∼ Qi ” subscripts above indicate that the expectations are +with respect to z (i) drawn from Qi . This allowed us to go from Equation (2) +to Equation (3). +Now, for any set of distributions Qi , the formula (3) gives a lower-bound +on (θ). There’re many possible choices for the Qi ’s. Which should we +choose? Well, if we have some current guess θ of the parameters, it seems +1 + +If z were continuous, then Qi would be a density, and the summations over z in our +discussion are replaced with integrals over z. + + 4 +natural to try to make the lower-bound tight at that value of θ. I.e., we’ll +make the inequality above hold with equality at our particular value of θ. +(We’ll see later how this enables us to prove that (θ) increases monotonically +with successsive iterations of EM.) +To make the bound tight for a particular value of θ, we need for the step +involving Jensen’s inequality in our derivation above to hold with equality. +For this to be true, we know it is sufficient that that the expectation be taken +over a “constant”-valued random variable. I.e., we require that +p(x(i) , z (i) ; θ) +=c +Qi (z (i) ) +for some constant c that does not depend on z (i) . This is easily accomplished +by choosing +Qi (z (i) ) ∝ p(x(i) , z (i) ; θ). +Actually, since we know +further tells us that + +z + +Qi (z (i) ) = 1 (because it is a distribution), this +p(x(i) , z (i) ; θ) +(i) +z p(x , z; θ) +p(x(i) , z (i) ; θ) += +p(x(i) ; θ) += p(z (i) |x(i) ; θ) + +Qi (z (i) ) = + +Thus, we simply set the Qi ’s to be the posterior distribution of the z (i) ’s +given x(i) and the setting of the parameters θ. +Now, for this choice of the Qi ’s, Equation (3) gives a lower-bound on the +loglikelihood that we’re trying to maximize. This is the E-step. In the +M-step of the algorithm, we then maximize our formula in Equation (3) with +respect to the parameters to obtain a new setting of the θ’s. Repeatedly +carrying out these two steps gives us the EM algorithm, which is as follows: +Repeat until convergence { +(E-step) For each i, set +Qi (z (i) ) := p(z (i) |x(i) ; θ). +(M-step) Set +Qi (z (i) ) log + +θ := arg max +θ + +i + +z (i) + +p(x(i) , z (i) ; θ) +. +Qi (z (i) ) + + 5 +} +How we we know if this algorithm will converge? Well, suppose θ (t) +and θ(t+1) are the parameters from two successive iterations of EM. We will +now prove that (θ (t) ) ≤ (θ(t+1) ), which shows EM always monotonically +improves the log-likelihood. The key to showing this result lies in our choice +of the Qi ’s. Specifically, on the iteration of EM in which the parameters had +(t) +started out as θ (t) , we would have chosen Qi (z (i) ) := p(z (i) |x(i) ; θ(t) ). We +saw earlier that this choice ensures that Jensen’s inequality, as applied to get +Equation (3), holds with equality, and hence +(t) + +Qi (z (i) ) log + +(θ(t) ) = +i + +p(x(i) , z (i) ; θ(t) ) +(t) + +Qi (z (i) ) + +z (i) + +. + +The parameters θ (t+1) are then obtained by maximizing the right hand side +of the equation above. Thus, +(θ + +(t+1) + +(t) +Qi (z (i) ) log + +) ≥ +i + +z (i) +(t) + +Qi (z (i) ) log + +≥ +i + += + +z (i) +(t) + +p(x(i) , z (i) ; θ(t+1) ) +(t) + +Qi (z (i) ) +p(x(i) , z (i) ; θ(t) ) +(t) + +Qi (z (i) ) + +(θ ) + +(4) +(5) +(6) + +This first inequality comes from the fact that +Qi (z (i) ) log + +(θ) ≥ +i + +z (i) + +p(x(i) , z (i) ; θ) +Qi (z (i) ) +(t) + +holds for any values of Qi and θ, and in particular holds for Qi = Qi , +θ = θ(t+1) . To get Equation (5), we used the fact that θ (t+1) is chosen +explicitly to be +Qi (z (i) ) log + +arg max +θ + +i + +z (i) + +p(x(i) , z (i) ; θ) +, +Qi (z (i) ) + +and thus this formula evaluated at θ (t+1) must be equal to or larger than the +same formula evaluated at θ (t) . Finally, the step used to get (6) was shown +(t) +earlier, and follows from Qi having been chosen to make Jensen’s inequality +hold with equality at θ (t) . + + 6 +Hence, EM causes the likelihood to converge monotonically. In our description of the EM algorithm, we said we’d run it until convergence. Given +the result that we just showed, one reasonable convergence test would be +to check if the increase in (θ) between successive iterations is smaller than +some tolerance parameter, and to declare convergence if EM is improving +(θ) too slowly. +Remark. If we define +Qi (z (i) ) log + +J(Q, θ) = +i + +z (i) + +p(x(i) , z (i) ; θ) +, +Qi (z (i) ) + +the we know (θ) ≥ J(Q, θ) from our previous derivation. The EM can also +be viewed a coordinate ascent on J, in which the E-step maximizes it with +respect to Q (check this yourself), and the M-step maximizes it with respect +to θ. + +3 + +Mixture of Gaussians revisited + +Armed with our general definition of the EM algorithm, lets go back to our +old example of fitting the parameters φ, µ and Σ in a mixture of Gaussians. +For the sake of brevity, we carry out the derivations for the M-step updates +only for φ and µj , and leave the updates for Σj as an exercise for the reader. +The E-step is easy. Following our algorithm derivation above, we simply +calculate +(i) +wj = Qi (z (i) = j) = P (z (i) = j|x(i) ; φ, µ, Σ). +Here, “Qi (z (i) = j)” denotes the probability of z (i) taking the value j under +the distribution Qi . +Next, in the M-step, we need to maximize, with respect to our parameters +φ, µ, Σ, the quantity +m + +Qi (z (i) ) log +i=1 z (i) +m + +p(x(i) , z (i) ; φ, µ, Σ) +Qi (z (i) ) + +k + +Qi (z (i) = j) log + += +i=1 j=1 +m + +k +(i) +wj + += +i=1 j=1 + +log + +p(x(i) |z (i) = j; µ, Σ)p(z (i) = j; φ) +Qi (z (i) = j) + +1 +(2π)n/2 |Σj |1/2 + +(i) +− µj ) · φ j +exp − 21 (x(i) − µj )T Σ−1 +j (x +(i) + +wj + + 7 +Lets maximize this with respect to µl . If we take the derivative with respect +to µl , we find +k + +m + +∇ µl + +(i) +wj + +log + +m + +k + +1 +(2π)n/2 |Σj |1/2 + +(i) +− µj ) · φ j +exp − 12 (x(i) − µj )T Σ−1 +j (x +(i) + +wj + +i=1 j=1 + += −∇µl + +(i) 1 + +wj +i=1 j=1 + += + +1 +2 + +2 + +(i) +(x(i) − µj )T Σ−1 +− µj ) +j (x + +m +(i) + +(i) +wl ∇µl 2µTl Σ−1 +− µTl Σ−1 +l x +l µl + +i=1 +m +(i) + += + +wl + +(i) +Σ−1 +− Σ−1 +l x +l µl + +i=1 + +Setting this to zero and solving for µl therefore yields the update rule +(i) (i) +m +i=1 wl x +, +(i) +m +w +i=1 l + +µl := + +which was what we had in the previous set of notes. +Lets do one more example, and derive the M-step update for the parameters φj . Grouping together only the terms that depend on φj , we find that +we need to maximize +m + +k + +(i) + +wj log φj . +i=1 j=1 + +However, there is an additional constraint that the φj ’s sum to 1, since they +represent the probabilities φj = p(z (i) = j; φ). To deal with the constraint +that kj=1 φj = 1, we construct the Lagrangian +m + +k + +k +(i) +wj + +L(φ) = + +log φj + β( + +i=1 j=1 + +φj − 1), +j=1 + +where β is the Lagrange multiplier.2 Taking derivatives, we find +∂ +L(φ) = +∂φj +2 + +m + +i=1 + +(i) + +wj ++1 +φj + +We don’t need to worry about the constraint that φj ≥ 0, because as we’ll shortly see, +the solution we’ll find from this derivation will automatically satisfy that anyway. + + 8 +Setting this to zero and solving, we get +φj = +I.e., φj ∝ +that −β = + +(i) + +m +i=1 + +wj +−β + +(i) +m +i=1 wj . Using the constraint +(i) +m +k +m +i=1 +j=1 wj = +i=1 1 = m. + +that + +φj = 1, we easily find +(i) + +(This used the fact that wj = + +Qi (z (i) = j), and since probabilities sum to 1, +have our M-step updates for the parameters φj : +1 +φj := +m + +j + +(i) + +j + +wj = 1.) We therefore + +m +(i) + +wj . +i=1 + +The derivation for the M-step updates to Σj are also entirely straightforward. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-notes9.txt b/Lectures/aimlcs229/cs229-notes9.txt new file mode 100644 index 0000000..37541ae --- /dev/null +++ b/Lectures/aimlcs229/cs229-notes9.txt @@ -0,0 +1,601 @@ +CS229 Lecture notes +Andrew Ng + +Part X + +Factor analysis +When we have data x(i) ∈ Rn that comes from a mixture of several Gaussians, +the EM algorithm can be applied to fit a mixture model. In this setting, +we usually imagine problems were the we have sufficient data to be able +to discern the multiple-Gaussian structure in the data. For instance, this +would be the case if our training set size m was significantly larger than the +dimension n of the data. +Now, consider a setting in which n +m. In such a problem, it might be +difficult to model the data even with a single Gaussian, much less a mixture of +Gaussian. Specifically, since the m data points span only a low-dimensional +subspace of Rn , if we model the data as Gaussian, and estimate the mean +and covariance using the usual maximum likelihood estimators, +1 +µ = +m +Σ = + +1 +m + +m + +x(i) +i=1 +m + +(x(i) − µ)(x(i) − µ)T , +i=1 + +we would find that the matrix Σ is singular. This means that Σ−1 does not +exist, and 1/|Σ|1/2 = 1/0. But both of these terms are needed in computing +the usual density of a multivariate Gaussian distribution. Another way of +stating this difficulty is that maximum likelihood estimates of the parameters +result in a Gaussian that places all of its probability in the affine space +spanned by the data,1 and this corresponds to a singular covariance matrix. +1 + +m +i=1 + +This is the set of points x satisfying x = + +1. + +1 + +αi x(i) , for some αi ’s so that + +m +i=1 + +α1 = + + 2 +More generally, unless m exceeds n by some reasonable amount, the maximum likelihood estimates of the mean and covariance may be quite poor. +Nonetheless, we would still like to be able to fit a reasonable Gaussian model +to the data, and perhaps capture some interesting covariance structure in +the data. How can we do this? +In the next section, we begin by reviewing two possible restrictions on +Σ, ones that allow us to fit Σ with small amounts of data but neither of +which will give a satisfactory solution to our problem. We next discuss some +properties of Gaussians that will be needed later; specifically, how to find +marginal and conditonal distributions of Gaussians. Finally, we present the +factor analysis model, and EM for it. + +1 + +Restrictions of Σ + +If we do not have sufficient data to fit a full covariance matrix, we may +place some restrictions on the space of matrices Σ that we will consider. For +instance, we may choose to fit a covariance matrix Σ that is diagonal. In this +setting, the reader may easily verify that the maximum likelihood estimate +of the covariance matrix is given by the diagonal matrix Σ satisfying +1 +Σjj = +m + +m +(i) + +(xj − µj )2 . +i=1 + +Thus, Σjj is just the empirical estimate of the variance of the j-th coordinate +of the data. +Recall that the contours of a Gaussian density are ellipses. A diagonal +Σ corresponds to a Gaussian where the major axes of these ellipses are axisaligned. +Sometimes, we may place a further restriction on the covariance matrix +that not only must it be diagonal, but its diagonal entries must all be equal. +In this setting, we have Σ = σ 2 I, where σ 2 is the parameter under our control. +The maximum likelihood estimate of σ 2 can be found to be: +1 +σ = +mn + +n + +m +(i) + +(xj − µj )2 . + +2 + +j=1 i=1 + +This model corresponds to using Gaussians whose densities have contours +that are circles (in 2 dimesions; or spheres/hyperspheres in higher dimensions). + + 3 +If we were fitting a full, unconstrained, covariance matrix Σ to data, it +was necessary that m ≥ n + 1 in order for the maximum likelihood estimate +of Σ not to be singular. Under either of the two restrictions above, we may +obtain non-singular Σ when m ≥ 2. +However, restricting Σ to be diagonal also means modeling the different +coordinates xi , xj of the data as being uncorrelated and independent. Often, +it would be nice to be able to capture some interesting correlation structure +in the data. If we were to use either of the restrictions on Σ described above, +we would therefore fail to do so. In this set of notes, we will describe the +factor analysis model, which uses more parameters than the diagonal Σ and +captures some correlations in the data, but also without having to fit a full +covariance matrix. + +2 + +Marginals and conditionals of Gaussians + +Before describing factor analysis, we digress to talk about how to find conditional and marginal distributions of random variables with a joint multivariate Gaussian distribution. +Suppose we have a vector-valued random variable +x= + +x1 +x2 + +, + +where x1 ∈ Rr , x2 ∈ Rs , and x ∈ Rr+s . Suppose x ∼ N (µ, Σ), where +µ= + +µ1 +µ2 + +, + +Σ= + +Σ11 Σ12 +Σ21 Σ22 + +. + +Here, µ1 ∈ Rr , µ2 ∈ Rs , Σ11 ∈ Rr×r , Σ12 ∈ Rr×s , and so on. Note that since +covariance matrices are symmetric, Σ12 = ΣT21 . +Under our assumptions, x1 and x2 are jointly multivariate Gaussian. +What is the marginal distribution of x1 ? It is not hard to see that E[x1 ] = µ1 , +and that Cov(x1 ) = E[(x1 − µ1 )(x1 − µ1 )] = Σ11 . To see that the latter is +true, note that by definition of the joint covariance of x1 and x2 , we have + + 4 +that +Cov(x) = Σ += + +Σ11 Σ12 +Σ21 Σ22 + += E[(x − µ)(x − µ)T ] += E += E + +x1 − µ 1 +x2 − µ 2 + +x1 − µ 1 +x2 − µ 2 + +T + +(x1 − µ1 )(x1 − µ1 )T (x1 − µ1 )(x2 − µ2 )T +(x2 − µ2 )(x1 − µ1 )T (x2 − µ2 )(x2 − µ2 )T + +. + +Matching the upper-left subblocks in the matrices in the second and the last +lines above gives the result. +Since marginal distributions of Gaussians are themselves Gaussian, we +therefore have that the marginal distribution of x1 is given by x1 ∼ N (µ1 , Σ11 ). +Also, we can ask, what is the conditional distribution of x1 given x2 ? By +referring to the definition of the multivariate Gaussian distribution, it can +be shown that x1 |x2 ∼ N (µ1|2 , Σ1|2 ), where +µ1|2 = µ1 + Σ12 Σ−1 +22 (x2 − µ2 ), +Σ1|2 = Σ11 − Σ12 Σ−1 +22 Σ21 . + +(1) +(2) + +When working with the factor analysis model in the next section, these +formulas for finding conditional and marginal distributions of Gaussians will +be very useful. + +3 + +The Factor analysis model + +In the factor analysis model, we posit a joint distribution on (x, z) as follows, +where z ∈ Rk is a latent random variable: +z ∼ N (0, I) +x|z ∼ N (µ + Λz, Ψ). +Here, the parameters of our model are the vector µ ∈ Rn , the matrix +Λ ∈ Rn×k , and the diagonal matrix Ψ ∈ Rn×n . The value of k is usually +chosen to be smaller than n. + + 5 +Thus, we imagine that each datapoint x(i) is generated by sampling a k +dimension multivariate Gaussian z (i) . Then, it is mapped to a k-dimensional +affine space of Rn by computing µ + Λz (i) . Lastly, x(i) is generated by adding +covariance Ψ noise to µ + Λz (i) . +Equivalently (convince yourself that this is the case), we can therefore +also define the factor analysis model according to +z ∼ N (0, I) +∼ N (0, Ψ) +x = µ + Λz + . +where and z are independent. +Lets work out exactly what distribution our model defines. Our random +variables z and x have a joint Gaussian distribution +z +x + +∼ N (µzx , Σ). + +We will now find µzx and Σ. +We know that E[z] = 0, from the fact that z ∼ N (0, I). Also, we have +that +E[x] = E[µ + Λz + ] += µ + ΛE[z] + E[ ] += µ. +Putting these together, we obtain +µzx = + +0 +µ + +Next, to find, Σ, we need to calculate Σzz = E[(z − E[z])(z − E[z])T ] (the +upper-left block of Σ), Σzx = E[(z − E[z])(x − E[x])T ] (upper-right block), +and Σxx = E[(x − E[x])(x − E[x])T ] (lower-right block). +Now, since z ∼ N (0, I), we easily find that Σzz = Cov(z) = I. Also, +E[(z − E[z])(x − E[x])T ] = E[z(µ + Λz + − µ)T ] += E[zz T ]ΛT + E[z T ] += ΛT . +In the last step, we used the fact that E[zz T ] = Cov(z) (since z has zero +mean), and E[z T ] = E[z]E[ T ] = 0 (since z and are independent, and + + 6 +hence the expectation of their product is the product of their expectations). +Similarly, we can find Σxx as follows: +E[(x − E[x])(x − E[x])T ] = += += += + +E[(µ + Λz + − µ)(µ + Λz + − µ)T ] +E[Λzz T ΛT + z T ΛT + Λz T + T ] +ΛE[zz T ]ΛT + E[ T ] +ΛΛT + Ψ. + +Putting everything together, we therefore have that +z +x + +0 +µ + +∼N + +, + +I +ΛT +Λ ΛΛT + Ψ + +. + +(3) + +Hence, we also see that the marginal distribution of x is given by x ∼ +N (µ, ΛΛT + Ψ). Thus, given a training set {x(i) ; i = 1, . . . , m}, we can write +down the log likelihood of the parameters: +m + +1 + +(µ, Λ, Ψ) = log +i=1 + +(2π)n/2 |ΛΛT + +1 +exp − (x(i) − µ)T (ΛΛT + Ψ)−1 (x(i) − µ) . ++ Ψ| +2 + +To perform maximum likelihood estimation, we would like to maximize this +quantity with respect to the parameters. But maximizing this formula explicitly is hard (try it yourself), and we are aware of no algorithm that does +so in closed-form. So, we will instead use to the EM algorithm. In the next +section, we derive EM for factor analysis. + +4 + +EM for factor analysis + +The derivation for the E-step is easy. We need to compute Qi (z (i) ) = +p(z (i) |x(i) ; µ, Λ, Ψ). By substituting the distribution given in Equation (3) +into the formulas (1-2) used for finding the conditional distribution of a +Gaussian, we find that z (i) |x(i) ; µ, Λ, Ψ ∼ N (µz(i) |x(i) , Σz(i) |x(i) ), where +µz(i) |x(i) = ΛT (ΛΛT + Ψ)−1 (x(i) − µ), +Σz(i) |x(i) = I − ΛT (ΛΛT + Ψ)−1 Λ. +So, using these definitions for µz(i) |x(i) and Σz(i) |x(i) , we have +Qi (z (i) ) = + +1 +(2π)k/2 |Σ + +z (i) |x + +1/2 +(i) | + +1 +exp − (z (i) − µz(i) |x(i) )T Σ−1 +(z (i) − µz(i) |x(i) ) . +z (i) |x(i) +2 + + 7 +Lets now work out the M-step. Here, we need to maximize +m + +Qi (z (i) ) log +i=1 + +z (i) + +p(x(i) , z (i) ; µ, Λ, Ψ) (i) +dz +Qi (z (i) ) + +(4) + +with respect to the parameters µ, Λ, Ψ. We will work out only the optimization with respect to Λ, and leave the derivations of the updates for µ and Ψ +as an exercise to the reader. +We can simplify Equation (4) as follows: +m + +Qi (z (i) ) log p(x(i) |z (i) ; µ, Λ, Ψ) + log p(z (i) ) − log Qi (z (i) ) dz (i) (5) +z (i) + +i=1 + +m + +Ez(i) ∼Qi log p(x(i) |z (i) ; µ, Λ, Ψ) + log p(z (i) ) − log Qi (z (i) ) + += + +(6) + +i=1 + +Here, the “z (i) ∼ Qi ” subscript indicates that the expectation is with respect +to z (i) drawn from Qi . In the subsequent development, we will omit this +subscript when there is no risk of ambiguity. Dropping terms that do not +depend on the parameters, we find that we need to maximize: +m + +E log p(x(i) |z (i) ; µ, Λ, Ψ) +i=1 +m + += + +E log +i=1 +m + += +i=1 + +1 +(2π)n/2 |Ψ|1/2 + +1 +exp − (x(i) − µ − Λz (i) )T Ψ−1 (x(i) − µ − Λz (i) ) +2 + +1 +n +1 +E − log |Ψ| − log(2π) − (x(i) − µ − Λz (i) )T Ψ−1 (x(i) − µ − Λz (i) ) +2 +2 +2 + +Lets maximize this with respect to Λ. Only the last term above depends +on Λ. Taking derivatives, and using the facts that tr a = a (for a ∈ R), +trAB = trBA, and ∇A trABAT C = CAB + C T AB, we get: +m + +∇Λ + +−E +i=1 +m + += +i=1 +m + += +i=1 +m + +1 (i) +(x − µ − Λz (i) )T Ψ−1 (x(i) − µ − Λz (i) ) +2 + +1 T +T +∇Λ E −tr z (i) ΛT Ψ−1 Λz (i) + trz (i) ΛT Ψ−1 (x(i) − µ) +2 +1 +T +T +∇Λ E −tr ΛT Ψ−1 Λz (i) z (i) + trΛT Ψ−1 (x(i) − µ)z (i) +2 +T + +E −Ψ−1 Λz (i) z (i) + Ψ−1 (x(i) − µ)z (i) + += +i=1 + +T + + 8 +Setting this to zero and simplifying, we get: +m + +(i) (i) T + +ΛEz(i) ∼Qi z z + +m + +(x(i) − µ)Ez(i) ∼Qi z (i) + += + +i=1 + +T + +. + +i=1 + +Hence, solving for Λ, we obtain +m + +(x + +Λ= + +(i) + +− µ)Ez(i) ∼Qi z + +−1 + +m + +(i) T + +(i) (i) T + +. + +Ez(i) ∼Qi z z + +i=1 + +(7) + +i=1 + +It is interesting to note the close relationship between this equation and the +normal equation that we’d derived for least squares regression, +“θT = (y T X)(X T X)−1 .” +The analogy is that here, the x’s are a linear function of the z’s (plus noise). +Given the “guesses” for z that the E-step has found, we will now try to +estimate the unknown linearity Λ relating the x’s and z’s. It is therefore +no surprise that we obtain something similar to the normal equation. There +is, however, one important difference between this and an algorithm that +performs least squares using just the “best guesses” of the z’s; we will see +this difference shortly. +To complete our M-step update, lets work out the values of the expectations in Equation (7). From our definition of Qi being Gaussian with mean +µz(i) |x(i) and covariance Σz(i) |x(i) , we easily find +Ez(i) ∼Qi z (i) + +T + += µTz(i) |x(i) + +Ez(i) ∼Qi z (i) z (i) + +T + += µz(i) |x(i) µTz(i) |x(i) + Σz(i) |x(i) . + +The latter comes from the fact that, for a random variable Y , Cov(Y ) = +E[Y Y T ] − E[Y ]E[Y ]T , and hence E[Y Y T ] = E[Y ]E[Y ]T + Cov(Y ). Substituting this back into Equation (7), we get the M-step update for Λ: +Λ= + +(x +i=1 + +−1 + +m + +m +(i) + +− + +µz(i) |x(i) µTz(i) |x(i) + +µ)µTz(i) |x(i) + ++ Σz(i) |x(i) + +. + +(8) + +i=1 + +It is important to note the presence of the Σz(i) |x(i) on the right hand side of +this equation. This is the covariance in the posterior distribution p(z (i) |x(i) ) +of z (i) give x(i) , and the M-step must take into account this uncertainty + + 9 +about z (i) in the posterior. A common mistake in deriving EM is to assume +that in the E-step, we need to calculate only expectation E[z] of the latent +random variable z, and then plug that into the optimization in the M-step +everywhere z occurs. While this worked for simple problems such as the +mixture of Gaussians, in our derivation for factor analysis, we needed E[zz T ] +as well E[z]; and as we saw, E[zz T ] and E[z]E[z]T differ by the quantity Σz|x . +Thus, the M-step update must take into account the covariance of z in the +posterior distribution p(z (i) |x(i) ). +Lastly, we can also find the M-step optimizations for the parameters µ +and Ψ. It is not hard to show that the first is given by +1 +µ= +m + +m + +x(i) . +i=1 + +Since this doesn’t change as the parameters are varied (i.e., unlike the update +for Λ, the right hand side does not depend on Qi (z (i) ) = p(z (i) |x(i) ; µ, Λ, Ψ), +which in turn depends on the parameters), this can be calculated just once +and needs not be further updated as the algorithm is run. Similarly, the +diagonal Ψ can be found by calculating +1 +Φ= +m + +m + +T + +T + +x(i) x(i) −x(i) µTz(i) |x(i) ΛT −Λµz(i) |x(i) x(i) +Λ(µz(i) |x(i) µTz(i) |x(i) +Σz(i) |x(i) )ΛT , +i=1 + +and setting Ψii = Φii (i.e., letting Ψ be the diagonal matrix containing only +the diagonal entries of Φ). + + \ No newline at end of file diff --git a/Lectures/aimlcs229/cs229-prob.txt b/Lectures/aimlcs229/cs229-prob.txt new file mode 100644 index 0000000..9ab012c --- /dev/null +++ b/Lectures/aimlcs229/cs229-prob.txt @@ -0,0 +1,725 @@ +Probability Theory Review for Machine Learning +Samuel Ieong +November 6, 2006 + +1 + +Basic Concepts + +Broadly speaking, probability theory is the mathematical study of uncertainty. It plays a +central role in machine learning, as the design of learning algorithms often relies on probabilistic assumption of the data. This set of notes attempts to cover some basic probability +theory that serves as a background for the class. + +1.1 + +Probability Space + +When we speak about probability, we often refer to the probability of an event of uncertain +nature taking place. For example, we speak about the probability of rain next Tuesday. +Therefore, in order to discuss probability theory formally, we must first clarify what the +possible events are to which we would like to attach probability. +Formally, a probability space is defined by the triple (Ω, F, P ), where +• Ω is the space of possible outcomes (or outcome space), +• F ⊆ 2Ω (the power set of Ω) is the space of (measurable) events (or event space), +• P is the probability measure (or probability distribution) that maps an event E ∈ F to +a real value between 0 and 1 (think of P as a function). +Given the outcome space Ω, there is some restrictions as to what subset of 2Ω can be +considered an event space F: +• The trivial event Ω and the empty event ∅ is in F. +• The event space F is closed under (countable) union, i.e., if α, β ∈ F, then α ∪ β ∈ F. +• The even space F is closed under complement, i.e., if α ∈ F , then (Ω \ α) ∈ F. +Example 1. Suppose we throw a (six-sided) dice. The space of possible outcomes Ω = +{1, 2, 3, 4, 5, 6}. We may decide that the events of interest is whether the dice throw is odd +or even. This event space will be given by F = {∅, {1, 3, 5}, {2, 4, 6}, Ω}. +1 + + Note that when the outcome space Ω is finite, as in the previous example, we often take +the event space F to be 2Ω . This treatment is not fully general, but it is often sufficient +for practical purposes. However, when the outcome space is infinite, we must be careful to +define what the event space is. +Given an event space F, the probability measure P must satisfy certain axioms. +• (non-negativity) For all α ∈ F , P (α) ≥ 0. +• (trivial event) P (Ω) = 1. +• (additivity) For all α, β ∈ F and α ∩ β = ∅, P (α ∪ β) = P (α) + P (β). +Example 2. Returning to our dice example, suppose we now take the event space F to be +2Ω . Further, we define a probability distribution P over F such that +P ({1}) = P ({2}) = · · · = P ({6}) = 1/6 +then this distribution P completely specifies the probability of any given event happening +(through the additivity axiom). For example, the probability of an even dice throw will be +P ({2, 4, 6}) = P ({2}) + P ({4}) + P ({6}) = 1/6 + 1/6 + 1/6 = 1/2 +since each of these events are disjoint. + +1.2 + +Random Variables + +Random variables play an important role in probability theory. The most important fact +about random variables is that they are not variables. They are actually functions that +map outcomes (in the outcome space) to real values. In terms of notation, we usually denote +random variables by a capital letter. Let’s see an example. +Example 3. Again, consider the process of throwing a dice. Let X be a random variable that +depends on the outcome of the throw. A natural choice for X would be to map the outcome +i to the value i, i.e., mapping the event of throwing an “one” to the value of 1. Note that +we could have chosen some strange mappings too. For example, we could have a random +variable Y that maps all outcomes to 0, which would be a very boring function, or a random +variable Z that maps the outcome i to the value of 2i if i is odd and the value of −i if i is +even, which would be quite strange indeed. +In a sense, random variables allow us to abstract away from the formal notion of event +space, as we can define random variables that capture the appropriate events. For example, +consider the event space of odd or even dice throw in Example 1. We could have defined a +random variable that takes on value 1 if outcome i is odd and 0 otherwise. These type of +binary random variables are very common in practice, and are known as indicator variables, +taking its name from its use to indicate whether a certain event has happened. So why +did we introduce event space? That is because when one studies probability theory (more +2 + + rigorously) using measure theory, the distinction between outcome space and event space +will be very important. This topic is too advanced to be covered in this short review note. +In any case, it is good to keep in mind that event space is not always simply the power set +of the outcome space. +From here onwards, we will talk mostly about probability with respect to random variables. While some probability concepts can be defined meaningfully without using them, +random variables allow us to provide a more uniform treatment of probability theory. For +notations, the probability of a random variable X taking on the value of a will be denoted +by either +P (X = a) +or +PX (a) +We will also denote the range of a random variable X by V al(X). + +1.3 + +Distributions, Joint Distributions, and Marginal Distributions + +We often speak about the distribution of a variable. This formally refers to the probability +of a random variable taking on certain values. For example, +Example 4. Let random variable X be defined on the outcome space Ω of a dice throw +(again!). If the dice is fair, then the distribution of X would be +PX (1) = PX (2) = · · · = PX (6) = 1/6 +Note that while this example resembles that of Example 2, they have different semantic +meaning. The probability distribution defined in Example 2 is over events, whereas the one +here is defined over random variables. +For notation, we will use P (X) to denote the distribution of the random variable X. +Sometimes, we speak about the distribution of more than one variables at a time. We +call these distributions joint distributions, as the probability is determined jointly by all the +variables involved. This is best clarified by an example. +Example 5. Let X be a random variable defined on the outcome space of a dice throw. Let +Y be an indicator variable that takes on value 1 if a coin flip turns up head and 0 if tail. +Assuming both the dice and the coin are fair, the joint distribution of X and Y is given by +P +Y =0 +Y =1 + +X=1 +1/12 +1/12 + +X=2 +1/12 +1/12 + +X=3 +1/12 +1/12 + +X=4 +1/12 +1/12 + +X=5 +1/12 +1/12 + +X=6 +1/12 +1/12 + +As before, we will denote the probability of X taking value a and Y taking value b by +either the long hand of P (X = a, Y = b), or the short hand of PX,Y (a, b). We refer to their +joint distribution by P (X, Y ). +Given a joint distribution, say over random variables X and Y , we can talk about the +marginal distribution of X or that of Y . The marginal distribution refers to the probability +distribution of a random variable on its own. To find out the marginal distribution of a +3 + + random variable, we sum out all the other random variables from the distribution. Formally, +we mean +P (X) = +P (X, Y = b) +(1) +b∈V al(Y ) + +The name of marginal distribution comes from the fact that if we add up all the entries +of a row (or a column) of a joint distribution, and write the answer at the end (i.e., margin) +of it, this will be the probability of the random variable taking on that value. Of course, +thinking in this way only helps when the joint distribution involves two variables. + +1.4 + +Conditional Distributions + +Conditional distributions are one of the key tools in probability theory for reasoning about +uncertainty. They specify the distribution of a random variable when the value of another +random variable is known (or more generally, when some event is known to be true). +Formally, conditional probability of X = a given Y = b is defined as +P (X = a|Y = b) = + +P (X = a, Y = b) +P (Y = b) + +(2) + +Note that this is not defined when the probability of Y = b is 0. +Example 6. Suppose we know that a dice throw was odd, and want to know the probability +of an “one” has been thrown. Let X be the random variable of the dice throw, and Y be an +indicator variable that takes on the value of 1 if the dice throw turns up odd, then we write +our desired probability as follows: +P (X = 1|Y = 1) = + +P (X = 1, Y = 1) +1/6 += += 1/3 +P (Y = 1) +1/2 + +The idea of conditional probability extends naturally to the case when the distribution +of a random variable is conditioned on several variables, namely +P (X = a|Y = b, Z = c) = + +P (X = a, Y = b, Z = c) +P (Y = b, Z = c) + +As for notations, we write P (X|Y = b) to denote the distribution of random variable X +when Y = b. We may also write P (X|Y ) to denote a set of distributions of X, one for each +of the different values that Y can take. + +1.5 + +Independence + +In probability theory, independence means that the distribution of a random variable does +not change on learning the value of another random variable. In machine learning, we often +make such assumptions about our data. For example, the training samples are assumed to +4 + + be drawn independently from some underlying space; the label of sample i is assumed to be +independent of the features of sample j (i = j). +Mathematically, a random variable X is independent of Y when +P (X) = P (X|Y ) +(Note that we have dropped what values X and Y are taking. This means the statement +holds true for any values X and Y may take.) +Using Equation (2), it is easy to verify that if X is independent of Y , then Y is also +independent of X. As a notation, we write X ⊥ Y if X and Y are independent. +An equivalent mathematical statement about the independence of random variables X +and Y is +P (X, Y ) = P (X)P (Y ) +Sometimes we also talk about conditional independence, meaning that if we know the +value of a random variable (or more generally, a set of random variables), then some other +random variables will be independent of each other. Formally, we say “X and Y are conditionally independent given Z” if +P (X|Z) = P (X|Y, Z) +or, equivalently, +P (X, Y |Z) = P (X|Z)P (Y |Z) +An example of conditional independence that we will se in class is the Na¨ıve Bayes +assumption. This assumption is made in the context of a learning algorithm for learning to +classify emails as spams or non-spams. It assumes that the probability of a word x appearing +in the email is conditionally independent of a word y appearing given whether the email is +spam or not. This clearly is not without loss of generality, as some words almost invariably +comes in pair. However, as it turns out, making this simplifying assumption does not hurt +the performance much, and in any case allow us to learn to classify spams rapidly. Details +can be found in Lecture Notes 2. + +1.6 + +Chain Rule and Bayes Rule + +We now present two basic yet important rules for manipulating that relates joint distributions +and conditional distributions. The first is known as the Chain Rule. It can be seen as a +generalization of Equation (2) to multiple random variables. +Theorem 1 (Chain Rule). +P (X1 , X2 , . . . , Xn ) = P (X1 )P (X2 |X1 ) · · · P (Xn |X1 , X2 , . . . , Xn−1 ) + +(3) + +The Chain Rule is often used to evaluate the joint probability of some random variables, +and is especially useful when there are (conditional) independence across variables. Notice +5 + + there is a choice in the order we unravel the random variables when applying the Chain Rule; +picking the right order can often make evaluating the probability much easier. +The second rule we are going to introduce is the Bayes Rule. The Bayes Rule allows +us to compute the conditional probability P (X|Y ) from P (Y |X), in a sense “inverting” the +conditions. It can be derived simply from Equation (2) as well. +Theorem 2 (Bayes Rule). +P (X|Y ) = + +P (Y |X)P (X) +P (Y ) + +(4) + +And recall that if P (Y ) is not given, we can always apply Equation (1) to find it. +P (Y ) = + +P (X = a, Y ) = +a∈V al(X) + +P (Y |X = a)P (X = a) +a∈V al(X) + +This application of Equation (1) is sometimes referred to as the law of total probability. +Extending the Bayes Rule to the case of multiple random variables can sometimes be +tricky. Just to be clear, we would give a few examples. When in doubt, one can always refer +to how conditional probabilities are defined and work out the details. +Example 7. Let’s consider the following conditional probabilities: P (X, Y |Z) and (X|Y, Z). +P (Z|X, Y )P (X, Y ) +P (Y, Z|X)P (X) += +P (Z) +P (Z) +P (Y |X, Z)P (X, Z) +P (Y |X, Z)P (X|Z)P (Z) +P (Y |X, Z)P (X|Z) +P (X|Y, Z) = += += +P (Y, Z) +P (Y |Z)P (Z) +P (Y |Z) + +P (X, Y |Z) = + +2 + +Defining a Probability Distribution + +We have been talking about probability distributions for a while. But how do we define +a distribution? In a broad sense, there are two classes of distribution that require seemingly different treatments (these can be unified using measure theory). Namely, discrete +distributions and continuous distributions. We will discuss how distributions are specified +next. +Note that this discussion is distinct from how we can efficiently represent a distribution. +The topic of efficient representation of probability distribution is in fact a very important +and active research area that deserves its own course. If you are interested to learn more +about how to efficiently represent, reason, and perform learning on distributions, you are +advised to take CS228: Probabilistic Models in Artificial Intelligence. + +2.1 + +Discrete Distribution: Probability Mass Function + +By a discrete distribution, we mean that the random variable of the underlying distribution +can take on only finitely many different values (or that the outcome space is finite). +6 + + To define a discrete distribution, we can simply enumerate the probability of the random +variable taking on each of the possible values. This enumeration is known as the probability +mass function, as it divides up a unit mass (the total probability) and places them on the +different values a random variable can take. This can be extended analogously to joint +distributions and conditional distributions. + +2.2 + +Continuous Distribution: Probability Density Function + +By a continuous distribution, we mean that the random variable of the underlying distribution can take on infinitely many different values (or that the outcome space is infinite). +This is arguably a trickier situation than the discrete case, since if we place a non-zero +amount of mass on each of the values, the total mass will add up to infinity, which violates +the requirement that the total probaiblity must sum up to one. +To define a continuous distribution, we will make use of probability density function +(PDF). A probability density function, f , is a non-negative, integrable function such that +f (x)dx = 1 +V al(X) + +The probability of a random variable X distributed according to a PDF f is computed +as follows +b + +P (a ≤ X ≤ b) = + +f (x)dx +a + +Note that this, in particular, implies that the probability of a continuously distributed +random variable taking on any given single value is zero. +Example 8 (Uniform distribution). Let’s consider a random variable X that is uniformly +distributed in the range [0, 1]. The corresponding PDF would be +f (x) = + +1 if 0 ≤ x ≤ 1 +0 otherwise + +1 + +We can verify that 0 1 dx is indeed 1, and therefore f is a PDF. To compute the probability +of X smaller than a half, +1/2 + +P (X ≤ 1/2) = + +1/2 + +1 dx = [x]0 + +0 + += 1/2 + +More generally, suppose X is distributed uniformly over the range [a, b], then the PDF +would be +1 +if a ≤ x ≤ b +f (x) = b−a +0 +otherwise + +7 + + Sometimes we will also speak about cumulative distribution function. It is a function +that gives the probability of a random variable being smaller than some value. A cumulative +distribution function F is related to the underlying probability density function f as follows: +b + +F (b) = P (X ≤ b) = + +f (x)dx +−∞ + +and hence F (x) = f (x)dx (in the sense of indefinite integral). +To extend the definition of continuous distribution to joint distribution, the probability +density function is extended to take multiple arguments, namely, +b1 + +b2 + +P (a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 , . . . , an ≤ Xn ≤ n1 ) = + +bn + +··· +a1 + +a2 + +f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn +an + +To extend the definition of conditional distribution to continuous random variables, we +ran into the problem that the probability of a continuous random variable taking on a single +value is 0, so Equation (2) is not well defined, since the denominator equals 0. To define the +conditional distribution of a continuous variable, let f (x, y) be the joint distribution of X +and Y . Through application of analysis, we can show that the PDF, f (y|x), underlying the +distribution P (Y |X) is given by +f (x, y) +f (y|x) = +f (x) +For example, + +b + +b + +f (y|c)dy = + +P (a ≤ Y ≤ b|X = c) = + +a + +a + +3 +3.1 + +f (c, y) +dy +f (c) + +Expectations and Variance +Expectations + +One of the most common operations we perform on a random variable is to compute its +expectation, also known as its mean, expected value, or first moment. The expectation of a +random variable, denoted by E(X), is given by +E(X) = + +aP (X = a) + +or + +E(X) = + +xf (x) dx + +(5) + +a∈V al(X) + +a∈V al(X) + +Example 9. Let X be the outcome of rolling a fair dice. The expectation of X is +1 +1 +1 +1 +E(X) = (1) + (2) + · · · + 6 = 3 +6 +6 +6 +2 +We may sometimes be interested in computing the expected value of some function f of +a random variable X. Recall, however, that a random variable is also a function itself, so +8 + + the easiest way to think about this is that we define a new random variable Y = f (X), and +compute the expected value of Y instead. +When working with indicator variables, a useful identify is the following: +E(X) = P (X = 1) + +for indicator variable X + +When working with the sums of random variables, one of the most important rule is the +linearity of expectations. +Theorem 3 (Linearity of Expectations). Let X1 , X2 , . . . , Xn be (possibly dependent) random variables, +E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) + +(6) + +The linearity of expectations is very powerful because there are no restrictions on whether +the random variables are independent or not. When we work on products of random variables, however, there is very little we can say in general. However, when the random variables +are independent, then +Theorem 4. Let X and Y be independent random variables, +E(XY ) = E(X)E(Y ) + +3.2 + +Variance + +The variance of a distribution is a measure of the “spread” of a distribution. Sometimes it +is also referred to as the second moment. It is defined as follows: +V ar(X) = E (X − E(X))2 + +(7) + +The variance of a random variable is often denoted by σ 2 . The reason that this is squared +is because we often want to find out σ, known as the standard deviation. The variance and +the standard deviation is related (obviously) by σ = V ar(X). +To find out the variance of a random variable X, it’s often easier to compute the following +instead +V ar(X) = E(X 2 ) − (E(X))2 +Note that unlike expectation, variance is not a linear function of a random variable X. +In fact, we can verify that the variance of (aX + b) is +V ar(aX + b) = a2 V ar(X) +If random variables X and Y are independent, then +V ar(X + Y ) = V ar(X)V ar(Y ) + +if X ⊥ Y + +Sometimes we also talk about the covariance of two random variables. This is a measure +of how “closely related” two random variables are. Its definition is as follows. +Cov(X, Y ) = E((X − E(X))(Y − E(Y ))) +9 + + 4 + +Some Important Distributions + +In this section, we will review some of the probability distributions that we will see in this +class. This is by no means a comprehensive list of distribution that one should know. In +particular, distributions such as the geoemtric, hypergeometric, and binomial distributions, +which are very useful in their own right and studied in introductory probability theory, are +not reviewed here. + +4.1 + +Bernoulli + +The Bernoulli distribution is one of the most basic distribution. A random variable distributed according to the Bernoulli distribution can take on two possible values, {0, 1}. It can +be specified by a single parameter p, and by convention we take p to be P (X = 1). It is +often used to indicate whether a trail is successful or not. +Sometimes it is useful to write the probability distribution of a Bernoulli random variable +X as follows +P (X) = px (1 − p)1−x +An example of the Bernoulli distribution in action is the classification task in Lecture +Notes 1. To develop the logistic regression algorithm for the task, we assume that the labels +are distributed according to the Bernoulli distribution given the features. + +4.2 + +Poisson + +The Poisson distribution is a very useful distribution that deals with the arrival of events. +It measures probaiblity of the number of events happening over a fixed period of time, given +a fixed average rate of occurrence, and that the events take place independently of the time +since the last event. It is parametrized by the average arrival rate λ. The probability mass +function is given by: +exp(−λ)λk +P (X = k) = +k! +The mean value of a Poisson random variable is λ, and its variance is also λ. +We will get to work on a learning algorithm that deals with Poisson random variables in +Homework 1, Problem 3. + +4.3 + +Gaussian + +The Gaussian distribution, also known as the normal distribution, is one of the most “versatile” distributions in probability theory, and appears in a wide variety of contexts. For +example, it can be used to approximate the binomial distribution when the number of experiments is large, or the Poisson distribution when the average arrival rate is high. It is +also related to the Law of Large Numbers. For many problems, we will also often assume +that when noise in the system is Gaussian distributed. The list of applications is endless. +10 + + 0.4 +Gaussian(0,1) +Gaussian(1,1) +Gaussian(0,2) + +0.35 +0.3 +0.25 +0.2 +0.15 +0.1 +0.05 +0 +−5 + +0 + +5 + +Figure 1: Gaussian distributions under different mean and variance + +The Gaussian distribution is determined by two parameters: the mean µ and the variance +σ . The probability density function is given by +2 + +f (x) = √ + +1 +(x − µ)2 +exp − +2σ 2 +2πσ + +(8) + +To get a better sense of how the distribution changes with respect to the mean and the +variance, we have plotted three different Gaussian distributions in Figure 1. +In our class, we will sometimes work with multi-variate Gaussian distributions. A kdimensional multi-variate Gaussian distribution is parametrized by (µ, Σ), where µ is now a +vector of means in Rk , and Σ is the covariance matrix in Rk×k , in other words, Σii = V ar(Xi ) +and Σij = Cov(Xi , Xj ). The probability density function is now defined over vectors of input, +given by +1 +1 +f (x) = +exp − (x − µ)T Σ−1 (x − µ) +(9) +2 +2π k |Σ| +(Recall that we denote the determinant of a matrix A by |A|, and its inverse by A−1 ) +To get a better sense of how a multi-variate Gaussian distribution depends on the covariance matrix, we can look at the figures in Lecture Notes 2, Pages 3—4. +Working with a multi-variate Gaussian distribution can be tricky and daunting at times. +One way to make our lives easier, at least as a way to get intuition on a problem, is to assume +that the covariances are zero when we first attempt a problem. When the covariances are +zero, the determinant |Σ| will simply be the product of the variances, and the inverse Σ−1 +can be found by taking the inverse of the diagonal entries of Σ. +11 + + 5 + +Working with Probabilities + +As we will be working with probabilities and distributions a lot in this class, listed below +are a few tips about efficient manipulation of distributions. + +5.1 + +The log trick + +In machine learning, we generally assume the independence of different samples. Therefore, +we often have to deal with the product of a (large) number of distributions. When our goal +is to optimize functions of such products, it is often easier if we first work with the logarithm +of such functions. As the logarithmic function is a strictly increasing function, it will not +distort where the maximum is located (although, most certainly, the maximum value of the +function before and after taking logarithm will be different). +As an example, consider the likelihood function in Lecture Notes 1, Page 17. +m +(i) + +(hθ (x(i) ))y (1 − hθ (x(i) ))1−y + +L(θ) = + +(i) + +i=1 + +I dare say this is a pretty mean-looking function. But by taking the logarithm of it, termed +log-likelihood function, we have instead +m + +y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) )) + +(θ) = log L(θ) = +i=1 + +Not the world’s prettiest function, but at least it’s more manageable. We can now work +on one term (i.e., one training sample) at a time, because they are summed together rather +than multiplied together. + +5.2 + +Delayed Normalization + +Because probability has to sum up to one, we often have to deal with normalization, especially +with continuous distribution. For example, for Gaussian distributions, the term outside of the +exponent is to ensure that the integral of the PDF evaluates to one. When we are sure that +the end product of some algebra will be a probability distribution, or when we are finding the +optimum of some distributions, it’s often easier to simply denote the normalization constant +to be Z, and not worry about computing the normalization constant all the time. + +5.3 + +Jenson’s Inequality + +Sometimes when we are evaluating the expectation of a function of a random variable, we +may only need a bound rather than its exact value. In these situations, if the function is +convex or concave, Jenson’s inequality allows us to derive a bound by evaluating the value +of the function at the expectation of the random variable itself. +12 + + 150 + +100 + +50 + +0 + +0 + +1 + +2 + +3 + +4 + +5 + +Figure 2: Illustration of Jenson’s Inequality + +Theorem 5 (Jenson’s Inequality). Let X be a random variable, and f be a convex function. +Then +f (E(X)) ≤ E(f (X)) +If f is a concave function, then +f (E(X)) ≥ E(f (X)) +While we can show Jenson’s inequality by algebra, it’s easiest to understand it through +a picture. The function in Figure 2 is a convex function. We can see that a straight line +between any two points on the function always lie above the function. This shows that if a +random variable can take on only two values, then Jenson’s inequality holds. It is relatively +straight forward to extend this to general random variables. + +13 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/info.txt b/Lectures/aimlcs229/info.txt new file mode 100644 index 0000000..d8ed8f1 --- /dev/null +++ b/Lectures/aimlcs229/info.txt @@ -0,0 +1,111 @@ +CS 229 +Machine Learning +Handout #1: Course Information +Teaching Staff and Contact Info +Professor: Andrew Ng +Office: Gates 156 +TA: Paul Baumstarck +Office: B24B +TA: Catie Chang +Office: B24A +TA: Chuong (Tom) Do +Office: B24A +TA: Zico Kolter (head TA) +Office: Gates 124 +TA: Daniel Ramage +Office: Gates 114 + +Course Description +This course provides a broad introduction to machine learning and statistical +pattern recognition. Topics include: supervised learning +(generative/discriminative learning, parametric/non-parametric learning, neural +networks, support vector machines); unsupervised learning (clustering, +dimensionality reduction, kernel methods); learning theory (bias/variance +tradeoffs; VC theory; large margins); reinforcement learning and +adaptive control. The course will also discuss recent applications of machine +learning, such as to robotic control, data mining, autonomous navigation, +bioinformatics, speech recognition, and text and web data processing. + + Prerequisites +Students are expected to have the following background: +Knowledge of basic computer science principles and skills, at a level +sufficient to write a reasonably non-trivial computer program. +Familiarity with the basic probability theory. (Stat 116 is sufficient but not +necessary.) +Familiarity with the basic linear algebra (any one of Math 51, Math 103, Math +113, or CS 205 would be much more than necessary.) + +Course Materials +There is no required text for this course. Notes will be posted periodically on +the course web site. The following books are recommended as optional reading: +Christopher Bishop, Pattern Recognition and Machine Learning. Springer, 2006. +Richard Duda, Peter Hart and David Stork, Pattern Classification, 2nd ed. John Wiley & +Sons, 2001. +Tom Mitchell, Machine Learning. McGraw-Hill, 1997. +Richard Sutton and Andrew Barto, Reinforcement Learning: An introduction. MIT Press, +1998 + +Homeworks and Grading +There will be four written homeworks, one midterm, and one major openended term project. The homeworks will contain written questions and +questions that require some Matlab programming. In the term project, you will +investigate some interesting aspect of machine learning or apply machine +learning to a problem that interests you. We try very hard to make questions +unambiguous, but some ambiguities may remain. Ask if confused or state your +assumptions explicitly. Reasonable assumptions will be accepted in case of +ambiguous questions. +A note on the honor code: We strongly encourage students to form study +groups. Students may discuss and work on homework problems in groups. +However, each student must write down the solutions independently, and +without referring to written notes from the joint session. In other words, each +student must understand the solution well enough in order to reconstruct it by +him/herself. In addition, each student should write on the problem set the + + set of people with whom s/he collaborated. Further, because we occasionally +reuse problem set questions from previous years, we expect students not to +copy, refer to, or look at the solutions in preparing their answers. It is +an honor code violation to intentionally refer to a previous year's solutions. +Late homeworks: Recognizing that students may face unusual circumstances +and requiresome flexibility in the course of the quarter, each student will have a +total of seven free late (calendar) days to use as s/he sees fit. Once these late +days are exhausted, any homework turned in late will be penalized 20% per late +day. However, no homework will be accepted more than four days after its +due date, and late days cannot be used for the final project writeup. Each 24 +hours or part thereof that a homework is late uses up one full late day. To +hand in a late homework, write down the date and time of submission, and +leave it in the submission box at the bottom of the Gates A-wing stairwell. +To get into the basement after the building is locked, slide your SUID card in +the card reader by the main basement entrance.) It is an honor code violation to +write down the wrong time. Regular (non-SCPD) students should submit +hardcopies of all four written homeworks. Please do not email your homework +solutions to us. Off-campus (SCPD) students should fax homework solutions to +us at the fax number given above, and write "ATTN: CS229 (Machine +Learning)" on the cover page. The term project may be done in teams of up to +three persons. The midterm is openbook/ open-notes, and will cover the +material of the first part of the course. It will take place on 11/8 at 6 pm, exact +location to be determined. +Course grades will be based 40% on homeworks (10% each), 20% on the +midterm, and 40% on the major term project. Up to 3% extra credit may be +awarded for class participation. + +Sections +To review material from the prerequisites or to supplement the lecture material, +there will occasionally be extra discussion sections held on Friday. An +announcement will be made whenever one of these sections is held. Attendance +at these sections is optional. + +Communication with the Teaching Staff +We strongly encourage students to come to office hours. If that is not possible, +questions should be sent to the course staff list (consisting of the TAs and + + the professor). By having questions sent to all of us, you will get answers much +more quickly. Of course, more personal questions can still be sent directly to +Professor Ng or the TAs. +For grading questions, please talk to us after class or during office hours. If you +want a regrade, write an explanation and drop the homework and the +explanation into the submission box at the bottom of the Gates A-wing +stairwell Answers to commonly asked questions and clarifications to the +homeworks will be posted on the FAQ. It is each student's responsibility to +check the FAQ on a regular basis. Major changes (e.g., bugs in the homework) +will also be posted to the class mailing list. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/practice-midterm.txt b/Lectures/aimlcs229/practice-midterm.txt new file mode 100644 index 0000000..ec558a5 --- /dev/null +++ b/Lectures/aimlcs229/practice-midterm.txt @@ -0,0 +1,228 @@ +1 + +CS229 Practice Midterm + +CS 229, Autumn 2007 +Practice Midterm +Notes: +1. The midterm will have about 5-6 long questions, and about 8-10 short questions. Space +will be provided on the actual midterm for you to write your answers. +2. The midterm is meant to be educational, and as such some questions could be quite +challenging. Use your time wisely to answer as much as you can! + +1. [13 points] Generalized Linear Models +Recall that generalized linear models assume that the response variable y (conditioned on +x) is distributed according to a member of the exponential family: +P (y; η) = b(y) exp(ηT (y) − a(η)), +where η = θT x. For this problem, we will assume η ∈ R. +(a) [10 points] Given a training set {(x(i) , y (i) )}m +i=1 , the loglikelihood is given by +m + +log p(y (i) | x(i) ; θ). + +ℓ(θ) = +i=1 + +Give a set conditions on b(y), T (y), and a(η) which ensure that the loglikelihood is +a concave function of θ (and thus has a unique maximum). Your conditions must be +reasonable, and should be as weak as possible. (E.g., the answer “any b(y), T (y), and +a(η) so that ℓ(θ) is concave” is not reasonable. Similarly, overly narrow conditions, +including ones that apply only to specific GLIMs, are also not reasonable.) +(b) [3 points] When the response variable is distributed according to a Normal distribu−y2 + +tion (with unit variance), we have b(y) = √12π e 2 , T (y) = y, and a(η) = +that the condition(s) you gave in part (a) hold for this setting. + +η2 +2 . + +Verify + +2. [15 points] Bayesian linear regression +Consider Bayesian linear regression using a Gaussian prior on the parameters θ ∈ Rn+1 . +Thus, in our prior, θ ∼ N (0, τ 2 In ), where τ 2 ∈ R, and In+1 is the n + 1-by-n + 1 identity +matrix. Also let the conditional distribution of y (i) given x(i) and θ be N (θT x(i) , σ 2 ), as +in our usual linear least-squares model.1 Let a set of m IID training examples be given +(with x(i) ∈ Rn+1 ). Recall that the MAP estimate of the parameters θ is given by: +m + +p(y (i) |x(i) , θ) p(θ) + +θMAP = arg max +θ + +i=1 + +Find, in closed form, the MAP estimate of the parameters θ. For this problem, you should +treat τ 2 and σ 2 as fixed, known, constants. [Hint: Your solution should involve deriving +something that looks a bit like the Normal equations.] +1 Equivalently, + +y (i) = θ T x(i) + ε(i) , where the ε(i) ’s are distributed IID N (0, σ2 ). + + 2 + +CS229 Practice Midterm + +3. [18 points] Kernels +In this problem, you will prove that certain functions K give valid kernels. Be careful to +justify every step in your proofs. Specifically, if you use a result proved either in the lecture +notes or homeworks, be careful to state exactly which result you’re using. +(a) [8 points] Let K(x, z) be a valid (Mercer) kernel over Rn ×Rn . Consider the function +given by +Ke (x, z) = exp(K(x, z)). +Show that Ke is a valid kernel. [Hint: There are many ways of proving this result, +but you might find the following two facts useful: (i) The Taylor expansion of ex is +∞ +given by ex = j=0 j!1 xj (ii) If a sequence of non-negative numbers ai ≥ 0 has a limit +a = limi→∞ ai , then a ≥ 0.] +(b) [8 points] The Gaussian kernel is given by the function +K(x, z) = e− + +||x−z||2 +σ2 + +, + +where σ 2 > 0 is some fixed, positive constant. We said in class that this is a valid +kernel, but did not prove it. Prove that the Gaussian kernel is indeed a valid kernel. +[Hint: The following fact may be useful. ||x − z||2 = ||x||2 − 2xT z + ||z||2 .] +4. [18 points] One-class SVM +Given an unlabeled set of examples {x(1) , . . . , x(m) } the one-class SVM algorithm tries to +find a direction w that maximally separates the data from the origin. 2 +More precisely, it solves the (primal) optimization problem: +minw +s.t. + +1 ⊤ +w w +2 +w⊤ x(i) ≥ 1 + +for all i = 1, . . . , m + +A new test example x is labeled 1 if w⊤ x ≥ 1, and 0 otherwise. +(a) [9 points] The primal optimization problem for the one-class SVM was given above. +Write down the corresponding dual optimization problem. Simplify your answer as +much as possible. In particular, w should not appear in your answer. +(b) [4 points] Can the one-class SVM be kernelized (both in training and testing)? +Justify your answer. +(c) [5 points] Give an SMO-like algorithm to optimize the dual. I.e., give an algorithm +that in every optimization step optimizes over the smallest possible subset of variables. +Also give in closed-form the update equation for this subset of variables. You should +also justify why it is sufficient to consider this many variables at a time in each step. +5. [18 points] Uniform Convergence +In this problem, we consider trying to estimate the mean of a biased coin toss. We will +repeatedly toss the coin and keep a running estimate of the mean. We would like to prove +2 This turns out to be useful for anomaly detection, but I assume you already have enough to keep you +entertained for the 2h 15min of the midterm, and thus wouldn’t want to read about it here. See the midterm +solutions for details. + + 3 + +CS229 Practice Midterm + +that (with high probability), after some initial set of N tosses, the running estimate from +that point on will always be accurate and never deviate too much from the true value. +More formally, let Xi ∼ Bernoulli(φ) be IID random variables. Let φˆn be our estimate for +φ after n observations: +n +1 +φˆn = +Xi . +n i=1 +We’d like to show that after a certain number of coin flips, our estimates will stay close +to the true value of φ. More formally, we’d like to show that for all γ, δ ∈ (0, 1/2], there +exists a value N such that +P max |φ − φˆn | > γ ≤ δ. +n≥N + +1 +)). +Show that in order to make the guarantee above, it suffices to have N = O( γ12 log( δγ +1 +1 +You may need to use the fact that for γ ∈ (0, 1/2], log( 1−exp(−2γ 2 ) ) = O(log( γ )). + +[Hint: Let An be the event that |φ − φˆn | > γ and consider taking a union bound over the +set of events An , An+1 , An+2 , . . . ..] +6. [40 points] Short Answers +The following questions require a true/false accompanied by one sentence of explanation, +or a reasonably short answer (usually at most 1-2 sentences or a figure). +To discourage random guessing, one point will be deducted for a wrong answer +on multiple choice questions! Also, no credit will be given for answers without +a correct explanation. +(a) [5 points] Let there be a binary classification problem with continuous-valued features. In Problem Set #1, you showed if we apply Gaussian discriminant analysis +using the same covariance matrix Σ for both classes, then the resulting decision boundary will be linear. What will the decision boundary look like if we modeled the two +classes using separate covariance matrices Σ0 and Σ1 ? (I.e., x(i) |y (i) = b ∼ N (µb , Σb ), +for b = 0 or 1.) +(b) [5 points] Consider a sequence of examples (x(1) , y (1) ), (x(2) , y (2) ), · · · , (x(m) , y (m) ). +Assume that for all i we have x(i) ≤ D and that the data are linearly separated with +a margin γ. Suppose that the perceptron algorithm makes exactly (D/γ)2 mistakes +on this sequence of examples. Now, suppose we use a feature mapping φ(·) to a +higher dimensional space and use the corresponding kernel perceptron algorithm on +the same sequence of data (now in the higher-dimensional feature space). Then the +kernel perceptron (implicitly operating in this higher dimensional feature space) will +make a number of mistakes that is +i. +ii. +iii. +iv. + +strictly less than (D/γ)2 . +equal to (D/γ)2 . +strictly more than (D/γ)2 . +impossible to say from the given information. + +(c) [5 points] Let any x(1) , x(2) , x(3) ∈ Rp be given (x(1) = x(2) , x(1) = x(3) , x(2) = x(3) ). +Also let any z (1) , z (2) , z (3) ∈ Rq be fixed. Then there exists a valid Mercer kernel +K : Rp × Rp → R such that for all i, j ∈ {1, 2, 3} we have K(x(i) , x(j) ) = (z (i) )⊤ z (j) . +True or False? + + CS229 Practice Midterm + +4 + +(d) [5 points] Let f : Rn → R be defined according to f (x) = x⊤ Ax + b⊤ x + c, where +A is symmetric positive definite. Suppose we use Newton’s method to minimize f . +Show that Newton’s method will find the optimum in exactly one iteration. You may +assume that Newton’s method is initialized with 0. +(e) [5 points] Consider binary classification, and let the input domain be X = {0, 1}n, +i.e., the space of all n-dimensional bit vectors. Thus, each sample x has n binaryvalued features. Let Hn be the class of all boolean functions over the input space. +What is |Hn | and V C(Hn )? +(f) [5 points] Suppose an ℓ1 -regularized SVM (with regularization parameter C > 0) is +trained on a dataset that is linearly separable. Because the data is linearly separable, +to minimize the primal objective, the SVM algorithm will set all the slack variables +to zero. Thus, the weight vector w obtained will be the same no matter what regularization parameter C is used (so long as it is strictly bigger than zero). True or +false? +(g) [5 points] Consider using hold-out cross validation (using 70% of the data for training, +30% for hold-out CV) to select the bandwidth parameter τ for locally weighted linear +regression. As the number of training examples m increases, would you expect the +value of τ selected by the algorithm to generally become larger, smaller, or neither of +the above? For this problem, assume that (the expected value of) y is a non-linear +function of x. +(h) [5 points] Consider a feature selection problem in which the mutual information +M I(xi , y) = 0 for all features xi . Also for every subset of features Si = {xi1 , · · · , xik } +of size < n/2 we have M I(Si , y) = 0.3 However there is a subset S ∗ of size exactly n/2 +such that M I(S ∗ , y) = 1. I.e. this subset of features allows us to predict y correctly. +Of the three feature selection algorithms listed below, which one do you expect to +work best on this dataset? +i. +ii. +iii. +iv. + +3 M I(S + +Forward Search. +Backward Search. +Filtering using mutual information M I(xi , y). +All three are expected to perform reasonably well. + +P P += Si y P (Si , y) log(P (Si , y)/P (Si )P (y)), where the first summation is over all possible values +of the features in Si . +i , y) + + \ No newline at end of file diff --git a/Lectures/aimlcs229/problemset1.txt b/Lectures/aimlcs229/problemset1.txt new file mode 100644 index 0000000..8e9e64d --- /dev/null +++ b/Lectures/aimlcs229/problemset1.txt @@ -0,0 +1,285 @@ +1 + +CS229 Problem Set #1 + +CS 229, Public Course +Problem Set #1: Supervised Learning +1. Newton’s method for computing least squares +In this problem, we will prove that if we use Newton’s method solve the least squares +optimization problem, then we only need one iteration to converge to θ∗ . +(a) Find the Hessian of the cost function J(θ) = + +1 +2 + +m +T (i) +i=1 (θ x + +− y (i) )2 . + +(b) Show that the first iteration of Newton’s method gives us θ⋆ = (X T X)−1 X T y, the +solution to our least squares problem. +2. Locally-weighted logistic regression +In this problem you will implement a locally-weighted version of logistic regression, where +we weight different training examples differently according to the query point. The locallyweighted logistic regression problem is to maximize +m + +λ +w(i) y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) )) . +ℓ(θ) = − θT θ + +2 +i=1 +The − λ2 θT θ here is what is known as a regularization parameter, which will be discussed +in a future lecture, but which we include here because it is needed for Newton’s method to +perform well on this task. For the entirety of this problem you can use the value λ = 0.0001. +Using this definition, the gradient of ℓ(θ) is given by +∇θ ℓ(θ) = X T z − λθ +where z ∈ Rm is defined by + +zi = w(i) (y (i) − hθ (x(i) )) + +and the Hessian is given by +H = X T DX − λI +where D ∈ Rm×m is a diagonal matrix with +Dii = −w(i) hθ (x(i) )(1 − hθ (x(i) )) +For the sake of this problem you can just use the above formulas, but you should try to +derive these results for yourself as well. +Given a query point x, we choose compute the weights +w(i) = exp − + +||x − x(i) ||2 +2τ 2 + +. + +Much like the locally weighted linear regression that was discussed in class, this weighting +scheme gives more when the “nearby” points when predicting the class of a new example. + + 2 + +CS229 Problem Set #1 + +(a) Implement the Newton-Raphson algorithm for optimizing ℓ(θ) for a new query point +x, and use this to predict the class of x. +The q2/ directory contains data and code for this problem. You should implement +the y = lwlr(X train, y train, x, tau) function in the lwlr.m file. This function takes as input the training set (the X train and y train matrices, in the form +described in the class notes), a new query point x and the weight bandwitdh tau. +Given this input the function should 1) compute weights w(i) for each training example, using the formula above, 2) maximize ℓ(θ) using Newton’s method, and finally 3) +output y = 1{hθ (x) > 0.5} as the prediction. +We provide two additional functions that might help. The [X train, y train] = +load data; function will load the matrices from files in the data/ folder. The function plot lwlr(X train, y train, tau, resolution) will plot the resulting classifier (assuming you have properly implemented lwlr.m). This function evaluates the +locally weighted logistic regression classifier over a large grid of points and plots the +resulting prediction as blue (predicting y = 0) or red (predicting y = 1). Depending +on how fast your lwlr function is, creating the plot might take some time, so we +recommend debugging your code with resolution = 50; and later increase it to at +least 200 to get a better idea of the decision boundary. +(b) Evaluate the system with a variety of different bandwidth parameters τ . In particular, +try τ = 0.01, 0.050.1, 0.51.0, 5.0. How does the classification boundary change when +varying this parameter? Can you predict what the decision boundary of ordinary +(unweighted) logistic regression would look like? +3. Multivariate least squares +So far in class, we have only considered cases where our target variable y is a scalar value. +Suppose that instead of trying to predict a single output, we have a training set with +multiple outputs for each example: +{(x(i) , y (i) ), i = 1, . . . , m}, x(i) ∈ Rn , y (i) ∈ Rp . +Thus for each training example, y (i) is vector-valued, with p entries. We wish to use a linear +model to predict the outputs, as in least squares, by specifying the parameter matrix Θ in +y = ΘT x, +where Θ ∈ Rn×p . +(a) The cost function for this case is +J(Θ) = + +1 +2 + +m + +p +(i) + +(ΘT x(i) )j − yj + +2 + +. + +i=1 j=1 + +Write J(Θ) in matrix-vector notation (i.e., without using any summations). [Hint: +Start with the m × n design matrix + + +— (x(1) )T — + — (x(2) )T —  + + +X= + +.. + + +. +— + +(x(m) )T + +— + + 3 + +CS229 Problem Set #1 + +and the m × p target matrix + + +— + — + +Y = + + +— + +(y (1) )T +(y (2) )T +.. +. +(y (m) )T + + +— +—  + + + +— + +and then work out how to express J(Θ) in terms of these matrices.] +(b) Find the closed form solution for Θ which minimizes J(Θ). This is the equivalent to +the normal equations for the multivariate case. +(c) Suppose instead of considering the multivariate vectors y (i) all at once, we instead +(i) +compute each variable yj separately for each j = 1, . . . , p. In this case, we have a p +individual linear models, of the form +(i) + +yj = θjT x(i) , j = 1, . . . , p. +(So here, each θj ∈ Rn ). How do the parameters from these p independent least +squares problems compare to the multivariate solution? +4. Naive Bayes +In this problem, we look at maximum likelihood parameter estimation using the naive +Bayes assumption. Here, the input features xj , j = 1, . . . , n to our model are discrete, +binary-valued variables, so xj ∈ {0, 1}. We call x = [x1 x2 · · · xn ]T to be the input vector. +For each training example, our output targets are a single binary-value y ∈ {0, 1}. Our +model is then parameterized by φj|y=0 = p(xj = 1|y = 0), φj|y=1 = p(xj = 1|y = 1), and +φy = p(y = 1). We model the joint distribution of (x, y) according to +p(y) = + +(φy )y (1 − φy )1−y +n + +p(x|y = 0) + +p(xj |y = 0) + += +j=1 +n + +(φj|y=0 )xj (1 − φj|y=0 )1−xj + += +j=1 +n + +p(x|y = 1) + +p(xj |y = 1) + += +j=1 +n + +(φj|y=1 )xj (1 − φj|y=1 )1−xj + += +j=1 + +m + +(a) Find the joint likelihood function ℓ(ϕ) = log i=1 p(x(i) , y (i) ; ϕ) in terms of the +model parameters given above. Here, ϕ represents the entire set of parameters +{φy , φj|y=0 , φj|y=1 , j = 1, . . . , n}. +(b) Show that the parameters which maximize the likelihood function are the same as + + 4 + +CS229 Problem Set #1 + +those given in the lecture notes; i.e., that +φj|y=0 + += + +φj|y=1 + += + +φy + += + +m +i=1 + +(i) + +1{xj = 1 ∧ y (i) = 0} + +m +(i) = 0} +i=1 1{y +(i) +m +(i) += +i=1 1{xj = 1 ∧ y +m +(i) += 1} +i=1 1{y +m +(i) += 1} +i=1 1{y + +m + +1} + +. + +(c) Consider making a prediction on some new data point x using the most likely class +estimate generated by the naive Bayes algorithm. Show that the hypothesis returned +by naive Bayes is a linear classifier—i.e., if p(y = 0|x) and p(y = 1|x) are the class +probabilities returned by naive Bayes, show that there exists some θ ∈ Rn+1 such +that +1 +≥ 0. +p(y = 1|x) ≥ p(y = 0|x) if and only if θT +x +(Assume θ0 is an intercept term.) +5. Exponential family and the geometric distribution +(a) Consider the geometric distribution parameterized by φ: +p(y; φ) = (1 − φ)y−1 φ, y = 1, 2, 3, . . . . +Show that the geometric distribution is in the exponential family, and give b(y), η, +T (y), and a(η). +(b) Consider performing regression using a GLM model with a geometric response variable. What is the canonical response function for the family? You may use the fact +that the mean of a geometric distribution is given by 1/φ. +(c) For a training set {(x(i) , y (i) ); i = 1, . . . , m}, let the log-likelihood of an example +be log p(y (i) |x(i) ; θ). By taking the derivative of the log-likelihood with respect to +θj , derive the stochastic gradient ascent rule for learning using a GLM model with +goemetric responses y and the canonical response function. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/problemset2.txt b/Lectures/aimlcs229/problemset2.txt new file mode 100644 index 0000000..89f6283 --- /dev/null +++ b/Lectures/aimlcs229/problemset2.txt @@ -0,0 +1,206 @@ +1 + +CS229 Problem Set #2 + +CS 229, Public Course +Problem Set #2: Kernels, SVMs, and Theory +1. Kernel ridge regression +In contrast to ordinary least squares which has a cost function +J(θ) = + +1 +2 + +m + +(θT x(i) − y (i) )2 , +i=1 + +we can also add a term that penalizes large weights in θ. In ridge regression, our least +squares cost is regularized by adding a term λ θ 2 , where λ > 0 is a fixed (known) constant +(regularization will be discussed at greater length in an upcoming course lecutre). The ridge +regression cost function is then +J(θ) = + +1 +2 + +m + +(θT x(i) − y (i) )2 + +i=1 + +λ +θ 2. +2 + +(a) Use the vector notation described in class to find a closed-form expreesion for the +value of θ which minimizes the ridge regression cost function. +(b) Suppose that we want to use kernels to implicitly represent our feature vectors in a +high-dimensional (possibly infinite dimensional) space. Using a feature mapping φ, +the ridge regression cost function becomes +J(θ) = + +1 +2 + +m + +(θT φ(x(i) ) − y (i) )2 + +i=1 + +λ +θ 2. +2 + +Making a prediction on a new input xnew would now be done by computing θT φ(xnew ). +Show how we can use the “kernel trick” to obtain a closed form for the prediction +on the new input without ever explicitly computing φ(xnew ). You may assume that +the parameter vector θ can be expressed as a linear combination of the input feature +m +vectors; i.e., θ = i=1 αi φ(x(i) ) for some set of parameters αi . +[Hint: You may find the following identity useful: +(λI + BA)−1 B = B(λI + AB)−1 . +If you want, you can try to prove this as well, though this is not required for the +problem.] +2. ℓ2 norm soft margin SVMs +In class, we saw that if our data is not linearly separable, then we need to modify our +support vector machine algorithm by introducing an error margin that must be minimized. +Specifically, the formulation we have looked at is known as the ℓ1 norm soft margin SVM. +In this problem we will consider an alternative method, known as the ℓ2 norm soft margin +SVM. This new algorithm is given by the following optimization problem (notice that the +slack penalties are now squared): +minw,b,ξ +s.t. + +1 +2 + +m + +w 2 + C2 i=1 ξi2 +. +(i) +y (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m + + 2 + +CS229 Problem Set #2 + +(a) Notice that we have dropped the ξi ≥ 0 constraint in the ℓ2 problem. Show that these +non-negativity constraints can be removed. That is, show that the optimal value of +the objective will be the same whether or not these constraints are present. +(b) What is the Lagrangian of the ℓ2 soft margin SVM optimization problem? +(c) Minimize the Lagrangian with respect to w, b, and ξ by taking the following gradients: +T +∇w L, ∂L +∂b , and ∇ξ L, and then setting them equal to 0. Here, ξ = [ξ1 , ξ2 , . . . , ξm ] . +(d) What is the dual of the ℓ2 soft margin SVM optimization problem? +3. SVM with Gaussian kernel +Consider the task of training a support vector machine using the Gaussian kernel K(x, z) = +exp(− x − z 2 /τ 2 ). We will show that as long as there are no two identical points in the +training set, we can always find a value for the bandwidth parameter τ such that the SVM +achieves zero training error. +(a) Recall from class that the decision function learned by the support vector machine +can be written as +m + +αi y (i) K(x(i) , x) + b. + +f (x) = +i=1 + +Assume that the training data {(x(1) , y (1) ), . . . , (x(m) , y (m) )} consists of points which +are separated by at least a distance of ǫ; that is, ||x(j) − x(i) || ≥ ǫ for any i = j. +Find values for the set of parameters {α1 , . . . , αm , b} and Gaussian kernel width τ +such that x(i) is correctly classified, for all i = 1, . . . , m. [Hint: Let αi = 1 for all i +and b = 0. Now notice that for y ∈ {−1, +1} the prediction on x(i) will be correct if +|f (x(i) ) − y (i) | < 1, so find a value of τ that satisfies this inequality for all i.] +(b) Suppose we run a SVM with slack variables using the parameter τ you found in part +(a). Will the resulting classifier necessarily obtain zero training error? Why or why +not? A short explanation (without proof) will suffice. +(c) Suppose we run the SMO algorithm to train an SVM with slack variables, under +the conditions stated above, using the value of τ you picked in the previous part, +and using some arbitrary value of C (which you do not know beforehand). Will this +necessarily result in a classifier that achieve zero training error? Why or why not? +Again, a short explanation is sufficient. +4. Naive Bayes and SVMs for Spam Classification +In this question you’ll look into the Naive Bayes and Support Vector Machine algorithms +for a spam classification problem. However, instead of implementing the algorithms yourself, you’ll use a freely available machine learning library. There are many such libraries +available, with different strengths and weaknesses, but for this problem you’ll use the +WEKA machine learning package, available at http://www.cs.waikato.ac.nz/ml/weka/. +WEKA implements many standard machine learning algorithms, is written in Java, and +has both a GUI and a command line interface. It is not the best library for very large-scale +data sets, but it is very nice for playing around with many different algorithms on medium +size problems. +You can download and install WEKA by following the instructions given on the website +above. To use it from the command line, you first need to install a java runtime environment, then add the weka.jar file to your CLASSPATH environment variable. Finally, you + + CS229 Problem Set #2 + +3 + +can call WEKA using the command: +java -t -T +For example, to run the Naive Bayes classifier (using the multinomial event model) on our +provided spam data set by running the command: +java weka.classifiers.bayes.NaiveBayesMultinomial -t spam train 1000.arff -T spam test.arff + +The spam classification dataset in the q4/ directory was provided courtesy of Christian +Shelton (cshelton@cs.ucr.edu). Each example corresponds to a particular email, and each +feature correspondes to a particular word. For privacy reasons we have removed the actual +words themselves from the data set, and instead label the features generically as f1, f2, etc. +However, the data set is from a real spam classification task, so the results demonstrate the +performance of these algorithms on a real-world problem. The q4/ directory actually contains several different training files, named spam train 50.arff, spam train 100.arff, +etc (the “.arff” format is the default format by WEKA), each containing the corresponding +number of training examples. There is also a single test set spam test.arff, which is a +hold out set used for evaluating the classifier’s performance. +(a) Run the weka.classifiers.bayes.NaiveBayesMultinomial classifier on the dataset +and report the resulting error rates. Evaluate the performance of the classifier using +each of the different training files (but each time using the same test file, spam test.arff). +Plot the error rate of the classifier versus the number of training examples. +(b) Repeat the previous part, but using the weka.classifiers.functions.SMO classifier, +which implements the SMO algorithm to train an SVM. How does the performance +of the SVM compare to that of Naive Bayes? +5. Uniform convergence +In class we proved that for any finite set of hypotheses H = {h1 , . . . , hk }, if we pick the +ˆ that minimizes the training error on a set of m examples, then with probability +hypothesis h +at least (1 − δ), +1 +2k +ˆ ≤ min ε(hi ) + 2 +log , +ε(h) +i +2m +δ +where ε(hi ) is the generalization error of hypothesis hi . Now consider a special case (often +called the realizable case) where we know, a priori, that there is some hypothesis in our +class H that achieves zero error on the distribution from which the data is drawn. Then +we could obviously just use the above bound with mini ε(hi ) = 0; however, we can prove a +better bound than this. +(a) Consider a learning algorithm which, after looking at m training examples, chooses +ˆ ∈ H that makes zero mistakes on this training data. (By our +some hypothesis h +assumption, there is at least one such hypothesis, possibly more.) Show that with +probability 1 − δ +ˆ ≤ 1 log k . +ε(h) +m +δ +Notice that since we do not have a square root here, this bound is much tighter. [Hint: +Consider the probability that a hypothesis with generalization error greater than γ +makes no mistakes on the training data. Instead of the Hoeffding bound, you might +also find the following inequality useful: (1 − γ)m ≤ e−γm .] + + CS229 Problem Set #2 + +4 + +(b) Rewrite the above bound as a sample complexity bound, i.e., in the form: for fixed +ˆ ≤ γ to hold with probability at least (1 − δ), it suffices that m ≥ +δ and γ, for ε(h) +f (k, γ, δ) (i.e., f (·) is some function of k, γ, and δ). + + \ No newline at end of file diff --git a/Lectures/aimlcs229/problemset3.txt b/Lectures/aimlcs229/problemset3.txt new file mode 100644 index 0000000..5f7bccd --- /dev/null +++ b/Lectures/aimlcs229/problemset3.txt @@ -0,0 +1,297 @@ +1 + +CS229 Problem Set #3 + +CS 229, Public Course +Problem Set #3: Learning Theory and Unsupervised Learning +1. Uniform convergence and Model Selection +In this problem, we will prove a bound on the error of a simple model selection procedure. +Let there be a binary classification problem with labels y ∈ {0, 1}, and let H1 ⊆ H2 ⊆ +. . . ⊆ Hk be k different finite hypothesis classes (|Hi | < ∞). Given a dataset S of m iid +training examples, we will divide it into a training set Strain consisting of the first (1 − β)m +examples, and a hold-out cross validation set Scv consisting of the remaining βm examples. +Here, β ∈ (0, 1). +ˆ i = arg minh∈H εˆS +(h) be the hypothesis in Hi with the lowest training error +Let h +train +i +ˆ +(on Strain ). Thus, hi would be the hypothesis returned by training (with empirical risk +minimization) using hypothesis class Hi and dataset Strain . Also let h⋆i = arg minh∈Hi ε(h) +be the hypothesis in Hi with the lowest generalization error. +ˆ i ’s using empirical risk minimization then +Suppose that our algorithm first finds all the h +ˆ 1, . . . , h +ˆ k } with +uses the hold-out cross validation set to select a hypothesis from this the {h +minimum training error. That is, the algorithm will output +ˆ = arg +h + +min + +ˆ 1 ,...,h +ˆk} +h∈{h + +εˆScv (h). + +For this question you will prove the following bound. Let any δ > 0 be fixed. Then with +probability at least 1 − δ, we have that +ˆ ≤ min +ε(h) + +i=1,...,k + +ε(h∗i ) + + +4|Hi | +2 +log +(1 − β)m +δ + ++ + +4k +2 +log +2βm +δ + +ˆi, +(a) Prove that with probability at least 1 − 2δ , for all h +ˆ i )| ≤ +ˆ i ) − εˆS (h +|ε(h +cv + +1 +4k +log . +2βm +δ + +(b) Use part (a) to show that with probability 1 − 2δ , +ˆ ≤ min ε(h +ˆ i) + +ε(h) +i=1,...,k + +2 +4k +log . +βm +δ + +ˆ i ). We know from class that for Hj , with probability 1 − +(c) Let j = arg mini ε(h +ˆ j ) − εˆS +(h⋆j )| ≤ +|ε(h +train + +2 +4|Hj | +log +, ∀hj ∈ Hj . +(1 − β)m +δ + +Use this to prove the final bound given at the beginning of this problem. + +δ +2 + + 2 + +CS229 Problem Set #3 + +2. VC Dimension +Let the input domain of a learning problem be X = R. Give the VC dimension for each +of the following classes of hypotheses. In each case, if you claim that the VC dimension is +d, then you need to show that the hypothesis class can shatter d points, and explain why +there are no d + 1 points it can shatter. +• h(x) = 1{a < x}, with parameter a ∈ R. +• h(x) = 1{a < x < b}, with parameters a, b ∈ R. +• h(x) = 1{a sin x > 0}, with parameter a ∈ R. +• h(x) = 1{sin(x + a) > 0}, with parameter a ∈ R. +3. ℓ1 regularization for least squares +In the previous problem set, we looked at the least squares problem where the objective +function is augmented with an additional regularization term λ θ 22 . In this problem we’ll +consider a similar regularized objective but this time with a penalty on the ℓ1 norm of +the parameters λ θ 1 , where θ 1 is defined as i |θi |. That is, we want to minimize the +objective +n +m +1 +|θi |. +(θT x(i) − y (i) )2 + λ +J(θ) = +2 i=1 +i=1 +There has been a great deal of recent interest in ℓ1 regularization, which, as we will see, +has the benefit of outputting sparse solutions (i.e., many components of the resulting θ are +equal to zero). +The ℓ1 regularized least squares problem is more difficult than the unregularized or ℓ2 +regularized case, because the ℓ1 term is not differentiable. However, there have been many +efficient algorithms developed for this problem that work very well in practive. One very +straightforward approach, which we have already seen in class, is the coordinate descent +method. In this problem you’ll derive and implement a coordinate descent algorithm for +ℓ1 regularized least squares, and apply it to test data. +(a) Here we’ll derive the coordinate descent update for a given θi . Given the X and +y matrices, as defined in the class notes, as well a parameter vector θ, how can we +adjust θi so as to minimize the optimization objective? To answer this question, we’ll +rewrite the optimization objective above as +J(θ) = + +1 +Xθ − y +2 + +2 +2 + ++λ θ + +1 + += + +1 +X θ¯ + Xi θi − y +2 + +2 +2 + ++ λ θ¯ + +1 + ++ λ|θi | + +where Xi ∈ Rm denotes the ith column of X, and θ¯ is equal to θ except with θ¯i = 0; +all we have done in rewriting the above expression is to make the θi term explicit in +the objective. However, this still contains the |θi | term, which is non-differentiable +and therefore difficult to optimize. To get around this we make the observation that +the sign of θi must either be non-negative or non-positive. But if we knew the sign of +θi , then |θi | becomes just a linear term. That, is, we can rewrite the objective as +J(θ) = + +1 +X θ¯ + Xi θi − y +2 + +2 +2 + ++ λ θ¯ + +1 + ++ λsi θi + +where si denotes the sign of θi , si ∈ {−1, 1}. In order to update θi , we can just +compute the optimal θi for both possible values of si (making sure that we restrict + + 3 + +CS229 Problem Set #3 + +the optimal θi to obey the sign restriction we used to solve for it), then look to see +which achieves the best objective value. +For each of the possible values of si , compute the resulting optimal value of θi . [Hint: +to do this, you can fix si in the above equation, then differentiate with respect to θi +to find the best value. Finally, clip θi so that it lies in the allowable range — i.e., for +si = 1, you need to clip θi such that θi ≥ 0.] +(b) Implement the above coordinate descent algorithm using the updates you found in +the previous part. We have provided a skeleton theta = l1ls(X,y,lambda) function +in the q3/ directory. To implement the coordinate descent algorithm, you should +repeatedly iterate over all the θi ’s, adjusting each as you found above. You can +terminate the process when θ changes by less than 10− 5 after all n of the updates. +(c) Test your implementation on the data provided in the q3/ directory. The [X, y, +theta true] = load data; function will load all the data — the data was generated +by y = X*theta true + 0.05*randn(20,1), but theta true is sparse, so that very +few of the columns of X actually contain relevant features. Run your l1ls.m implementation on this data set, ranging λ from 0.001 to 10. Comment briefly on how this +algorithm might be used for feature selection. +4. K-Means Clustering +In this problem you’ll implement the K-means clustering algorithm on a synthetic data +set. There is code and data for this problem in the q4/ directory. Run load ’X.dat’; +to load the data file for clustering. Implement the [clusters, centers] = k means(X, +k) function in this directory. As input, this function takes the m × n data matrix X and +the number of clusters k. It should output a m element vector, clusters, which indicates +which of the clusters each data point belongs to, and a k × n matrix, centers, which +contains the centroids of each cluster. Run the algorithm on the data provided, with k = 3 +and k = 4. Plot the cluster assignments and centroids for each iteration of the algorithm +using the draw clusters(X, clusters, centroids) function. For each k, be sure to run +the algorithm several times using different initial centroids. +5. The Generalized EM algorithm +When attempting to run the EM algorithm, it may sometimes be difficult to perform the M +step exactly — recall that we often need to implement numerical optimization to perform +the maximization, which can be costly. Therefore, instead of finding the global maximum +of our lower bound on the log-likelihood, and alternative is to just increase this lower bound +a little bit, by taking one step of gradient ascent, for example. This is commonly known +as the Generalized EM (GEM) algorithm. +Put slightly more formally, recall that the M-step of the standard EM algorithm performs +the maximization +Qi (z (i) ) log + +θ := arg max +θ + +i + +z (i) + +p(x(i) , z (i) ; θ) +. +Qi (z (i) ) + +The GEM algorithm, in constrast, performs the following update in the M-step: +Qi (z (i) ) log + +θ := θ + α∇θ +i + +z (i) + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + +where α is a learning rate which we assume is choosen small enough such that we do not +decrease the objective function when taking this gradient step. + + 4 + +CS229 Problem Set #3 + +(a) Prove that the GEM algorithm described above converges. To do this, you should +show that the the likelihood is monotonically improving, as it does for the EM algorithm — i.e., show that ℓ(θ(t+1) ) ≥ ℓ(θ(t) ). +(b) Instead of using the EM algorithm at all, suppose we just want to apply gradient ascent +to maximize the log-likelihood directly. In other words, we are trying to maximize +the (non-convex) function +ℓ(θ) = + +p(x(i) , z (i) ; θ) + +log +i + +z (i) + +θ := θ + α∇θ + +log + +so we could simply use the update + +i + +p(x(i) , z (i) ; θ). +z (i) + +Show that this procedure in fact gives the same update as the GEM algorithm described above. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/problemset4.txt b/Lectures/aimlcs229/problemset4.txt new file mode 100644 index 0000000..9df78b6 --- /dev/null +++ b/Lectures/aimlcs229/problemset4.txt @@ -0,0 +1,312 @@ +1 + +CS229 Problem Set #4 + +CS 229, Public Course +Problem Set #4: Unsupervised Learning and Reinforcement Learning +1. EM for supervised learning +In class we applied EM to the unsupervised learning setting. In particular, we represented +p(x) by marginalizing over a latent random variable +p(x) = + +p(x, z) = +z + +p(x|z)p(z). +z + +However, EM can also be applied to the supervised learning setting, and in this problem we +discuss a “mixture of linear regressors” model; this is an instance of what is often call the +Hierarchical Mixture of Experts model. We want to represent p(y|x), x ∈ Rn and y ∈ R, +and we do so by again introducing a discrete latent random variable +p(y|x) = + +p(y, z|x) = +z + +p(y|x, z)p(z|x). +z + +For simplicity we’ll assume that z is binary valued, that p(y|x, z) is a Gaussian density, +and that p(z|x) is given by a logistic regression model. More formally += g(φT x)z (1 − g(φT x))1−z +−(y − θiT x)2 +1 +exp +p(y|x, z = i; θi ) = √ +2σ 2 +2πσ +p(z|x; φ) + +i = 1, 2 + +where σ is a known parameter and φ, θ0 , θ1 ∈ Rn are parameters of the model (here we +use the subscript on θ to denote two different parameter vectors, not to index a particular +entry in these vectors). +Intuitively, the process behind model can be thought of as follows. Given a data point x, +we first determine whether the data point belongs to one of two hidden classes z = 0 or +z = 1, using a logistic regression model. We then determine y as a linear function of x +(different linear functions for different values of z) plus Gaussian noise, as in the standard +linear regression model. For example, the following data set could be well-represented by +the model, but not by standard linear regression. + + 2 + +CS229 Problem Set #4 + +(a) Suppose x, y, and z are all observed, so that we obtain a training set +{(x(1) , y (1) , z (1) ), . . . , (x(m) , y (m) , z (m) )}. Write the log-likelihood of the parameters, +and derive the maximum likelihood estimates for φ, θ0 , and θ1 . Note that because +p(z|x) is a logistic regression model, there will not exist a closed form estimate of φ. +In this case, derive the gradient and the Hessian of the likelihood with respect to φ; +in practice, these quantities can be used to numerically compute the ML esimtate. +(b) Now suppose z is a latent (unobserved) random variable. Write the log-likelihood of +the parameters, and derive an EM algorithm to maximize the log-likelihood. Clearly +specify the E-step and M-step (again, the M-step will require a numerical solution, +so find the appropriate gradients and Hessians). +2. Factor Analysis and PCA +In this problem we look at the relationship between two unsupervised learning algorithms +we discussed in class: Factor Analysis and Principle Component Analysis. +Consider the following joint distribution over (x, z) where z ∈ Rk is a latent random +variable +z +x|z + +∼ +∼ + +N (0, I) +N (U z, σ 2 I). + +where U ∈ Rn×k is a model parameters and σ 2 is assumed to be a known constant. This +model is often called Probabilistic PCA. Note that this is nearly identical to the factor +analysis model except we assume that the variance of x|z is a known scaled identity matrix +rather than the diagonal parameter matrix, Φ, and we do not add an additional µ term to +the mean (though this last difference is just for simplicity of presentation). However, as +we will see, it turns out that as σ 2 → 0, this model is equivalent to PCA. +For simplicity, you can assume for the remainder of the problem that k = 1, i.e., that U is +a column vector in Rn . + +(a) Use the rules for manipulating Gaussian distributions to determine the joint distribution over (x, z) and the conditional distribution of z|x. [Hint: for later parts of +this problem, it will help significantly if you simplify your soluting for the conditional +distribution using the identity we first mentioned in problem set #1: (λI +BA)−1 B = +B(λI + AB)−1 .] +(b) Using these distributions, derive an EM algorithm for the model. Clearly state the +E-step and the M-step of the algorithm. +(c) As σ 2 → 0, show that if the EM algorithm convergences to a parameter vector U ⋆ +(and such convergence is guarenteed by the argument presented in class), then U ⋆ +m +1 +(i) (i) T +— i.e., +must be an eigenvector of the sample covariance matrix Σ = m +i=1 x x +U ⋆ must satisfy +λU ⋆ = ΣU ⋆ . +[Hint: When σ 2 → 0, Σz|x → 0, so the E step only needs to compute the means +µz|x and not the variances. Let w ∈ Rm be a vector containing all these means, +wi = µz(i) |x(i) , and show that the E step and M step can be expressed as +w= + +XU +, +UT U + +U= + +XT w +wT w + + CS229 Problem Set #4 + +3 + +respectively. Finally, show that if U doesn’t change after this update, it must satisfy +the eigenvector equation shown above. ] +3. PCA and ICA for Natural Images +In this problem we’ll apply Principal Component Analysis and Independent Component +Analysis to images patches collected from “natural” image scenes (pictures of leaves, grass, +etc). This is one of the classical applications of the ICA algorithm, and sparked a great +deal of interest in the algorithm; it was observed that the bases recovered by ICA closely +resemble image filters present in the first layer of the visual cortex. +The q3/ directory contains the data and several useful pieces of code for this problem. The +raw images are stored in the images/ subdirectory, though you will not need to work with +these directly, since we provide code for loading and normalizing the images. +Calling the function [X ica, X pca] = load images; will load the images, break them +into 16x16 images patches, and place all these patches into the columns of the matrices X ica and X pca. We create two different data sets for PCA and ICA because the +algorithms require slightly different methods of preprocessing the data.1 +For this problem you’ll implement the ica.m and pca.m functions, using the PCA and +ICA algorithms described in the class notes. While the PCA implementation should be +straightforward, getting a good implementation of ICA can be a bit trickier. Here is some +general advice to getting a good implementation on this data set: +• Picking a good learning rate is important. In our experiments we used α = 0.0005 on +this data set. +• Batch gradient descent doesn’t work well for ICA (this has to do with the fact that +ICA objective function is not concave), but the pure stochastic gradient described in +the notes can be slow (There are about 20,000 16x16 images patches in the data set, +so one pass over the data using the stochastic gradient rule described in the notes +requires inverting the 256x256 W matrix 20,000 times). Instead, a good compromise +is to use a hybrid stochastic/batch gradient descent where we calculate the gradient +with respect to several examples at a time (100 worked well for us), and use this to +update W . Our implementation makes 10 total passes over the entire data set. +• It is a good idea to randomize the order of the examples presented to stochastic +gradient descent before each pass over the data. +• Vectorize your Matlab code as much as possible. For general examples of how to do +this, look at the Matlab review session. +For reference, computing the ICA W matrix for the entire set of image patches takes about +5 minutes on a 1.6 Ghz laptop using our implementation. +After you’ve learned the U matrix for PCA (the columns of U should contain the principal +components of the data) and the W matrix of ICA, you can plot the basis functions using +the plot ica bases(W); and plot pca bases(U); functions we have provide. Comment +briefly on the difference between the two sets of basis functions. +1 Recall that the first step of performing PCA is to subtract the mean and normalize the variance of the features. +For the image data we’re using, the preprocessing step for the ICA algorithm is slightly different, though the +precise mechanism and justification is not imporant for the sake of this problem. Those who are curious about +the details should read Bell and Sejnowki’s paper “The ’Independent Components’ of Natural Scenes are Edge +Filters,” which provided the basis for the implementation we use in this problem. + + 4 + +CS229 Problem Set #4 + +4. Convergence of Policy Iteration +In this problem we show that the Policy Iteration algorithm, described in the lecture notes, +is guarenteed to find the optimal policy for an MDP. First, define B π to be the Bellman +operator for policy π, defined as follows: if V ′ = B(V ), then +V ′ (s) = R(s) + γ + +Psπ(s) (s′ )V (s′ ). +s′ ∈S + +(a) Prove that if V1 (s) ≤ V2 (s) for all s ∈ S, then B(V1 )(s) ≤ B(V2 )(s) for all s ∈ S. + +(b) Prove that for any V , + +B π (V ) − V π + +∞ + +≤γ V −Vπ + +∞ + +where V ∞ = maxs∈S |V (s)|. Intuitively, this means that applying the Bellman +operator B π to any value function V , brings that value function “closer” to the value +function for π, V π . This also means that applying B π repeatedly (an infinite number +of times) +B π (B π (. . . B π (V ) . . .)) +will result in the value function V π (a little bit more is needed to make this completely +formal, but we won’t worry about that here). +[Hint: Use the fact that for any α, x ∈ Rn , if i αi = 1 and αi ≥ 0, then i αi xi ≤ +maxi xi .] +(c) Now suppose that we have some policy π, and use Policy Iteration to choose a new +policy π ′ according to +Psa (s′ )V π (s′ ). + +π ′ (s) = arg max +a∈A + +s′ ∈S + +Show that this policy will never perform worse that the previous one — i.e., show +′ +that for all s ∈ S, V π (s) ≤ V π (s). +′ +[Hint: First show that V π (s) ≤ B π (V π )(s), then use the proceeding excercises to +′ +′ +show that B π (V π )(s) ≤ V π (s).] + +(d) Use the proceeding exercises to show that policy iteration will eventually converge +(i.e., produce a policy π ′ = π). Furthermore, show that it must converge to the +optimal policy π ⋆ . For the later part, you may use the property that if some value +function satisfies +V (s) = R(s) + γ max +a∈A + +s′ ∈ SPsa (s′ )V (s′ ) + +then V = V ⋆ . +5. Reinforcement Learning: The Mountain Car +In this problem you will implement the Q-Learning reinforcement learning algorithm described in class on a standard control domain known as the Mountain Car.2 The Mountain +Car domain simulates a car trying to drive up a hill, as shown in the figure below. +2 The + +dynamics of this domain were taken from Sutton and Barto, 1998. + + 5 + +CS229 Problem Set #4 + +0.6 + +0.4 + +0.2 + +0 + +−0.2 + +−0.4 + +−0.6 + +−1.2 + +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +All states except those at the top of the hill have a constant reward R(s) = −1, while the +goal state at the hilltop has reward R(s) = 0; thus an optimal agent will try to get to the +top of the hill as fast as possible (when the car reaches the top of the hill, the episode is +over, and the car is reset to its initial position). However, when starting at the bottom +of the hill, the car does not have enough power to reach the top by driving forward, so +it must first accerlaterate accelerate backwards, building up enough momentum to reach +the top of the hill. This strategy of moving away from the goal in order to reach the goal +makes the problem difficult for many classical control algorithms. +As discussed in class, Q-learning maintains a table of Q-values, Q(s, a), for each state and +action. These Q-values are useful because, in order to select an action in state s, we only +need to check to see which Q-value is greatest. That is, in state s we take the action +arg max Q(s, a). +a∈A + +The Q-learning algorithm adjusts its estimates of the Q-values as follows. If an agent is in +state s, takes action a, then ends up in state s′ , Q-learning will update Q(s, a) by +Q(s, a) = (1 − α)Q(s, a) + γ(R(s′ ) + γ max +Q(s′ , a′ ). +′ +a ∈A + +At each time, your implementation of Q-learning can execute the greedy policy π(s) = +arg maxa∈A Q(s, a) +Implement the [q, steps per episode] = qlearning(episodes) function in the q5/ +directory. As input, the function takes the total number of episodes (each episode starts +with the car at the bottom of the hill, and lasts until the car reaches the top), and outputs +a matrix of the Q-values and a vector indicating how many steps it took before the car was +able to reach the top of the hill. You should use the [x, s, absorb] = mountain car(x, +actions(a)) function to simulate one control cycle for the task — the x variable describes +the true (continuous) state of the system, whereas the s variable describes the discrete +index of the state, which you’ll use to build the Q values. +Plot a graph showing the average number of steps before the car reaches the top of the +hill versus the episode number (there is quite a bit of variation in this quantity, so you will +probably want to average these over a large number of episodes, as this will give you a +better idea of how the number of steps before reaching the hilltop is decreasing). You can +also visualize your resulting controller by calling the draw mountain car(q) function. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/projectGuidelines.txt b/Lectures/aimlcs229/projectGuidelines.txt new file mode 100644 index 0000000..2dbee4a --- /dev/null +++ b/Lectures/aimlcs229/projectGuidelines.txt @@ -0,0 +1,169 @@ +CS229 Final Project Guidelines + +1 + +CS 229, Autumn 2007 +Final Project Guidelines and Suggestions +1 + +Project overview + +One of CS229’s goals is to prepare you to (i) apply state-of-the-art machine learning algorithms +to an application, and (ii) do research in machine learning. The class’s final project will offer +you an opportunity to do exactly this. +The important dates for the CS229 project are: +• Proposals: Due at noon on Friday, 10/19. +• Milestone: Due at noon on Friday, 11/16. +• Poster presentations: Morning of Wednesday, 12/12. +• Final writeup: Due at 11:59pm on Friday, 12/14 (no late days). +Projects can be done in teams of up to three students. If you have a project of such grandiose +scope and ambition that it cannot be done by a team of only three persons, you can propose +doing a project in a team of four. + +2 + +Project topics + +Your first task is to pick a project topic. If you’re looking for project ideas, please come to either +Prof. Ng or the TAs’ office hours, and we’d be happy to brainstorm and suggest some project +ideas. In the meantime, here are some suggestions that might also help. +Most students do one of three kinds of projects: +1. Application project. This is by far the most common: Pick an application that interests +you, and explore how best to apply learning algorithms to solve it. +2. Algorithmic project. Pick a problem or family of problems, and develop a new learning +algorithm, or a novel variant of an existing algorithm, to solve it. +3. Theoretical project. Prove some interesting/non-trivial properties of a new or an existing learning algorithm. (This is often quite difficult, and so very few, if any, projects will +try to do this.) +Some projects will also combine elements of applications and algorithms and theory. +Many fantastic class projects come from students picking either an application that they’re +interested in, or picking some sub-field of machine learning that they want to explore more, and +working on that as their project. If you haven’t worked on a research project before but would +like to, you can also use this as an opportunity to try your hand at it. (Just be sure to ask us +for help if you’re uncertain how to best get started.) + + CS229 Final Project Guidelines + +2 + +Alternatively, if you’re already working on a research project that machine learning might be +applicable to, then working out how to apply learning to it will often make a very good project +topic. Similarly, if you currently work in industry and have an application on which machine +learning might help, that could also make a great project. +A very good CS229 project will comprise a publishable (or nearly-publishable) piece of work. +Each year, some number of students continue working on their projects after completing CS229, +and submit their work to a conference or journal. +So, for inspiration, you might also look at some recent machine learning research papers. Two +of the main machine learning conferences are ICML and NIPS. You can also find papers from recent ICML conferences online (http://www.icml2006.org/icml2006/technical/accepted.html, +http://oregonstate.edu/conferences/icml2007/paperlist.html). All NIPS papers are online, at http://books.nips.cc/. Finally, to see a list of last year’s class projects, you can go +to http://www.stanford.edu/class/cs229/projects2006.html . +Projects will be evaluated based on:1 +• The technical quality of the work. (I.e., Does the technical material make sense? Are the +things tried reasonable? Are the proposed algorithms or applications clever and interesting? +Do the authors convey novel insight about the problem and/or algorithms?) +• Significance. (Did the authors choose an interesting or a “real” problem to work on, or +only a small toy problem? Is this work likely to be useful and/or have impact?) +• The novelty of the work, and the clarity of the writeup. +Lastly, a few words of advice: Many of the best class projects come from students working +on topics that they’re excited about. So, pick something that you personally can get excited and +passionate about! In addition, don’t be timid, and when in doubt go for whatever’s the more +ambitious option. Finally, if you’re not sure what would or would not make a good project, +please feel strongly encouraged to either email us or come to office hours to talk about project +ideas. + +3 + +Project submission logistics + +This section contains the detailed instructions for submitting different parts of your project. +You probably do not need to read any of this in great detail until nearly the due date of the +submissions. + +3.1 + +Project proposals + +Your proposal should be a normal (plain ASCII) email, giving the title of the project, the +full names of all of your team members, and about a 300-500 word description of what you plan +to do. Please send your proposal as a normal email and not as an attachment, or use any other +document format (such as PDF or MS-Word). +1 Don’t overthink these criteria, nor worry too much if you’re not sure that you can do well on all of them. +Instead, just think of them as an “ideal” that you should aspire to, particularly if your goal is to do publishable +research work. + + CS229 Final Project Guidelines + +3.2 + +3 + +Milestone + +The project milestone is due at noon on Friday, 11/16, which is roughly halfway between the +proposal and the final project due dates. Your milestone report should describe what you’ve +accomplished so far, and very briefly say what else you plan to do. +The milestone will help you make sure you’re on-track. You should write it as if it’s an “early +draft” of what will turn into your final project. Specifically, you can write it as if you’re writing +the first few pages of your final project report, so that you can re-use most of the milestone +text in your final report. Please write the milestone (and final report) keeping in mind that the +intended audience is Prof. Ng and the TAs. Thus, for example, you should not spend two pages +explaining what logistic regression is. +Submission instructions: Your milestone report should be at most 3 pages long. Please send +the milestone as an email attachment, to +. Please submit +your milestone in PDF format. In particular, we do not accept MS-Word, OpenOffice, +PostScript, or any other document format. You should name your PDF file according to the format “yourConcatenatedLastNames-ProjectTitle.pdf”. For example, if the project partners +are John Doe and Jane Smith, and your project title is “Learning to Recognize People,” then +name your PDF file DoeSmith-LearningToRecognizePeople.pdf. +In addition, in the body of the email (which should be plain ASCII, and not HTML), list +the full names of all the team members on the first line, and state the full title of your project +on the second line. (The rest of the email can be anything you like.) For example, +From: Jane.Smith@stanford.edu +To: +Cc: John.Doe@stanford.edu +Subject: Project milestone +John Doe and Jane Smith +Learning to recognize people +Please follow these submission instructions exactly. Failure to do so may result in our email +system not receiving your file. + +3.3 + +Poster presentations + +The class projects will be presented at a poster presentation on Wednesday, 12/12. Each team +should prepare a poster, and be prepared to give a very short explanation, in front of the poster, +about their work. At the poster session, you’ll also have an opportunity to see what everyone +else did for their projects. (SCPD students living outside the bay area are exempt from this.) +We’ll supply poster-boards and easels for displaying the posters. + +3.4 + +Final writeup + +Final project writeups are due at 11:59pm on Friday, 12/14. Late days cannot be used for +the final writeup. Final project writeups can be at most 5 pages long. Apart from the page +limit, please follow the same submission instructions (such as filename and format of email) as +the milestone. + + CS229 Final Project Guidelines + +4 + +If you did this work in collaboration with someone else, or if someone else (such as another +professor) had advised you on this work, your writeup must fully acknowledge their contributions. +After the class, we will post all the final writeups online so that you can read about each +others’ work. If you do not want your writeup to be posted online, then please let us know at +least a week in advance of the final writeup submission deadline, and we’ll give you a different +email address to which you may send your writeup. + +4 + +Miscellany + +If, after CS229, you want to submit your work to a machine learning conference, the ICML +2008 deadline will be in early February next year (see http://icml2008.cs.helsinki.fi/ +for details), and the NIPS deadline is usually in June (http://www.nips.cc/). Of course, +depending on the topic of your project, other non-machine learning conferences may also be +more appropriate. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/ps1_solution.txt b/Lectures/aimlcs229/ps1_solution.txt new file mode 100644 index 0000000..d7f046c --- /dev/null +++ b/Lectures/aimlcs229/ps1_solution.txt @@ -0,0 +1,1001 @@ +1 + +CS229 Problem Set #1 Solutions + +CS 229, Public Course +Problem Set #1 Solutions: Supervised Learning +1. Newton’s method for computing least squares +In this problem, we will prove that if we use Newton’s method solve the least squares +optimization problem, then we only need one iteration to converge to θ∗ . +(a) Find the Hessian of the cost function J(θ) = +Answer: As shown in the class notes +∂J(θ) += +∂θj + +1 +2 + +m +T (i) +i=1 (θ x + +− y (i) )2 . + +m +(i) + +(θT x(i) − y (i) )xj . +i=1 + +So +∂ 2 J(θ) +∂θj ∂θk + +m + += +i=1 +m + +∂ +(i) +(θT x(i) − y (i) )xj +∂θk +(i) (i) + +xj xk = (X T X)jk + += +i=1 + +Therefore, the Hessian of J(θ) is H = X T X. This can also be derived by simply applying +rules from the lecture notes on Linear Algebra. +(b) Show that the first iteration of Newton’s method gives us θ⋆ = (X T X)−1 X T y, the +solution to our least squares problem. +Answer: Given any θ(0) , Newton’s method finds θ(1) according to +θ(1) + += θ(0) − H −1 ∇θ J(θ(0) ) += θ(0) − (X T X)−1 (X T Xθ(0) − X T y) += θ(0) − θ(0) + (X T X)−1 X T y += (X T X)−1 X T y. + +Therefore, no matter what θ(0) we pick, Newton’s method always finds θ⋆ after one +iteration. +2. Locally-weighted logistic regression +In this problem you will implement a locally-weighted version of logistic regression, where +we weight different training examples differently according to the query point. The locallyweighted logistic regression problem is to maximize +m + +λ +ℓ(θ) = − θT θ + +w(i) y (i) log hθ (x(i) ) + (1 − y (i) ) log(1 − hθ (x(i) )) . +2 +i=1 + + 2 + +CS229 Problem Set #1 Solutions + +The − λ2 θT θ here is what is known as a regularization parameter, which will be discussed +in a future lecture, but which we include here because it is needed for Newton’s method to +perform well on this task. For the entirety of this problem you can use the value λ = 0.0001. +Using this definition, the gradient of ℓ(θ) is given by +∇θ ℓ(θ) = X T z − λθ +where z ∈ Rm is defined by + +zi = w(i) (y (i) − hθ (x(i) )) + +and the Hessian is given by +H = X T DX − λI +where D ∈ Rm×m is a diagonal matrix with +Dii = −w(i) hθ (x(i) )(1 − hθ (x(i) )) +For the sake of this problem you can just use the above formulas, but you should try to +derive these results for yourself as well. +Given a query point x, we choose compute the weights +w(i) = exp − + +||x − x(i) ||2 +2τ 2 + +. + +Much like the locally weighted linear regression that was discussed in class, this weighting +scheme gives more when the “nearby” points when predicting the class of a new example. +(a) Implement the Newton-Raphson algorithm for optimizing ℓ(θ) for a new query point +x, and use this to predict the class of x. +The q2/ directory contains data and code for this problem. You should implement +the y = lwlr(X train, y train, x, tau) function in the lwlr.m file. This function takes as input the training set (the X train and y train matrices, in the form +described in the class notes), a new query point x and the weight bandwitdh tau. +Given this input the function should 1) compute weights w(i) for each training example, using the formula above, 2) maximize ℓ(θ) using Newton’s method, and finally 3) +output y = 1{hθ (x) > 0.5} as the prediction. +We provide two additional functions that might help. The [X train, y train] = +load data; function will load the matrices from files in the data/ folder. The function plot lwlr(X train, y train, tau, resolution) will plot the resulting classifier (assuming you have properly implemented lwlr.m). This function evaluates the +locally weighted logistic regression classifier over a large grid of points and plots the +resulting prediction as blue (predicting y = 0) or red (predicting y = 1). Depending +on how fast your lwlr function is, creating the plot might take some time, so we +recommend debugging your code with resolution = 50; and later increase it to at +least 200 to get a better idea of the decision boundary. +Answer: Our implementation of lwlr.m: +function y = lwlr(X_train, y_train, x, tau) +m = size(X_train,1); +n = size(X_train,2); + + 3 + +CS229 Problem Set #1 Solutions + +theta = zeros(n,1); +% compute weights +w = exp(-sum((X_train - repmat(x’, m, 1)).^2, 2) / (2*tau)); +% perform Newton’s method +g = ones(n,1); +while (norm(g) > 1e-6) +h = 1 ./ (1 + exp(-X_train * theta)); +g = X_train’ * (w.*(y_train - h)) - 1e-4*theta; +H = -X_train’ * diag(w.*h.*(1-h)) * X_train - 1e-4*eye(n); +theta = theta - H \ g; +end +% return predicted y +y = double(x’*theta > 0); +(b) Evaluate the system with a variety of different bandwidth parameters τ . In particular, +try τ = 0.01, 0.050.1, 0.51.0, 5.0. How does the classification boundary change when +varying this parameter? Can you predict what the decision boundary of ordinary +(unweighted) logistic regression would look like? +Answer: +These are the resulting decision boundaries, for the different values of τ . +tau = 0.01 + +tau = 0.05 + +tau = 0.1 + +tau = 0.5 + +tau = 0.5 + +tau = 5 + +For smaller τ , the classifier appears to overfit the data set, obtaining zero training error, +but outputting a sporadic looking decision boundary. As τ grows, the resulting decision boundary becomes smoother, eventually converging (in the limit as τ → ∞ to the +unweighted linear regression solution). +3. Multivariate least squares +So far in class, we have only considered cases where our target variable y is a scalar value. +Suppose that instead of trying to predict a single output, we have a training set with + + 4 + +CS229 Problem Set #1 Solutions + +multiple outputs for each example: +{(x(i) , y (i) ), i = 1, . . . , m}, x(i) ∈ Rn , y (i) ∈ Rp . +Thus for each training example, y (i) is vector-valued, with p entries. We wish to use a linear +model to predict the outputs, as in least squares, by specifying the parameter matrix Θ in +y = ΘT x, +where Θ ∈ Rn×p . +(a) The cost function for this case is +J(Θ) = + +1 +2 + +p + +m + +(i) + +(ΘT x(i) )j − yj + +2 + +. + +i=1 j=1 + +Write J(Θ) in matrix-vector notation (i.e., without using any summations). [Hint: +Start with the m × n design matrix + + +— (x(1) )T — + — (x(2) )T —  + + +X= + +.. + + +. +— + +(x(m) )T + +— + +— + — + +Y = + + +(y (1) )T +(y (2) )T +.. +. + + +— +—  + + + + +and the m × p target matrix + + + +(y (m) )T + +— + +— + +and then work out how to express J(Θ) in terms of these matrices.] +Answer: The objective function can be expressed as +J(Θ) = + +1 +tr (XΘ − Y )T (XΘ − Y ) . +2 + +To see this, note that +J(Θ) + += += += + += + +1 +tr (XΘ − Y )T (XΘ − Y ) +2 +1 +XΘ − Y )T (XΘ − Y ) +2 i +1 +2 +1 +2 + +(XΘ − Y )2ij +i +m + +j +p +(i) + +(ΘT x(i) )j − yj +i=1 j=1 + +2 + +ii + + 5 + +CS229 Problem Set #1 Solutions + +(b) Find the closed form solution for Θ which minimizes J(Θ). This is the equivalent to +the normal equations for the multivariate case. +Answer: First we take the gradient of J(Θ) with respect to Θ. +∇Θ J(Θ) + += ∇Θ += ∇Θ + +1 +tr (XΘ − Y )T (XΘ − Y ) +2 +1 +tr ΘT X T XΘ − ΘT X T Y − Y T XΘ − Y T T +2 + +1 +∇Θ tr(ΘT X T XΘ) − tr(ΘT X T Y ) − tr(Y T XΘ) + tr(Y T Y ) +2 +1 += +∇Θ tr(ΘT X T XΘ) − 2tr(Y T XΘ) + tr(Y T Y ) +2 +1 +X T XΘ + X T XΘ − 2X T Y += +2 += X T XΘ − X T Y + += + +Setting this expression to zero we obtain +Θ = (X T X)−1 X T Y. +This looks very similar to the closed form solution in the univariate case, except now Y +is a m × p matrix, so then Θ is also a matrix, of size n × p. +(c) Suppose instead of considering the multivariate vectors y (i) all at once, we instead +(i) +compute each variable yj separately for each j = 1, . . . , p. In this case, we have a p +individual linear models, of the form +(i) + +yj = θjT x(i) , j = 1, . . . , p. +(So here, each θj ∈ Rn ). How do the parameters from these p independent least +squares problems compare to the multivariate solution? +Answer: This time, we construct a set of vectors + (1)  +yj + (2)  + yj  + +yj =  + ..  , j = 1, . . . , p. + .  +(m) + +yj + +Then our j-th linear model can be solved by the least squares solution +θj = (X T X)−1 X T yj . + +If we line up our θj , we see that we have the following equation: +[θ1 θ2 · · · θp ] + += + +(X T X)−1 X T y1 (X T X)−1 X T y2 · · · (X T X)−1 X T yp + += (X T X)−1 X T [y1 y2 · · · yp ] += (X T X)−1 X T Y += Θ. +Thus, our p individual least squares problems give the exact same solution as the multivariate least squares. + + 6 + +CS229 Problem Set #1 Solutions + +4. Naive Bayes +In this problem, we look at maximum likelihood parameter estimation using the naive +Bayes assumption. Here, the input features xj , j = 1, . . . , n to our model are discrete, +binary-valued variables, so xj ∈ {0, 1}. We call x = [x1 x2 · · · xn ]T to be the input vector. +For each training example, our output targets are a single binary-value y ∈ {0, 1}. Our +model is then parameterized by φj|y=0 = p(xj = 1|y = 0), φj|y=1 = p(xj = 1|y = 1), and +φy = p(y = 1). We model the joint distribution of (x, y) according to +p(y) = + +(φy )y (1 − φy )1−y +n + +p(x|y = 0) + +p(xj |y = 0) + += +j=1 +n + +(φj|y=0 )xj (1 − φj|y=0 )1−xj + += +j=1 +n + +p(x|y = 1) + +p(xj |y = 1) + += +j=1 +n + +(φj|y=1 )xj (1 − φj|y=1 )1−xj + += +j=1 + +m + +(a) Find the joint likelihood function ℓ(ϕ) = log i=1 p(x(i) , y (i) ; ϕ) in terms of the +model parameters given above. Here, ϕ represents the entire set of parameters +{φy , φj|y=0 , φj|y=1 , j = 1, . . . , n}. +Answer: +m + +ℓ(ϕ) + += + +p(x(i) , y (i) ; ϕ) + +log +i=1 +m + += + +p(x(i) |y (i) ; ϕ)p(y (i) ; ϕ) + +log +i=1 +m + += + +log +i=1 +m + += +i=1 +m + + + + + + + +n + + + +j=1 + +(i) + +p(xj |y (i) ; ϕ) p(y (i) ; ϕ) +n + +log p(y (i) ; ϕ) + + +(i) + +j=1 + + + +log p(xj |y (i) ; ϕ) + +y (i) log φy + (1 − y (i) ) log(1 − φy ) + += +i=1 + +n +(i) +xj + ++ +j=1 + +log φj|y(i) + (1 − + +(i) +xj ) log(1 + + + +− φj|y(i) )  + +(b) Show that the parameters which maximize the likelihood function are the same as + + 7 + +CS229 Problem Set #1 Solutions + +those given in the lecture notes; i.e., that +φj|y=0 + += + +φj|y=1 + += + +φy + += + +m +i=1 + +(i) + +1{xj = 1 ∧ y (i) = 0} + +m +(i) = 0} +i=1 1{y +(i) +m +(i) += +i=1 1{xj = 1 ∧ y +m +(i) += 1} +i=1 1{y +m +(i) += 1} +i=1 1{y + +1} + +. + +m + +Answer: The only terms in ℓ(ϕ) which have non-zero gradient with respect to φj|y=0 +are those which include φj|y(i) . Therefore, +m + +∇φj|y=0 ℓ(ϕ) + +(i) + +(i) + += ∇φj|y=0 + +xj log φj|y(i) + (1 − xj ) log(1 − φj|y(i) ) +i=1 +m +(i) + +xj log(φj|y=0 )1{y (i) = 0} + += ∇φj|y=0 +i=1 + +(i) + ++ (1 − xj ) log(1 − φj|y=0 )1{y (i) = 0} +m +(i) + +xj + += +i=1 + +1 +1 +(i) +1{y (i) = 0} − (1 − xj ) +1{y (i) = 0} . +φj|y=0 +1 − φj|y=0 + +Setting ∇φj|y=0 ℓ(ϕ) = 0 gives +m + +0 + +1 + +(i) + +xj + += +i=1 +m + +φj|y=0 + +(i) + +1{y (i) = 0} − (1 − xj ) + +1 +1{y (i) = 0} +1 − φj|y=0 + +(i) + +(i) + +xj (1 − φj|y=0 )1{y (i) = 0} − (1 − xj )φj|y=0 1{y (i) = 0} + += +i=1 +m + +(i) + +(xj − φj|y=0 )1{y (i) = 0} + += +i=1 +m + +m +(i) + +xj · 1{y (i) = 0} − φj|y=0 + += +i=1 +m + +1{y (i) = 0} +i=1 +m + +(i) + +1{y (i) = 0}. + +1{xj = 1 ∧ y (i) = 0} − φj|y=0 + += + +i=1 + +i=1 + +We then arrive at our desired result +φj|y=0 = + +m +i=1 + +(i) + +1{xj = 1 ∧ y (i) = 0} +m +i=1 + +1{y (i) = 0} + +The solution for φj|y=1 proceeds in the identical manner. + + 8 + +CS229 Problem Set #1 Solutions + +To solve for φy , +m + +∇φy ℓ(ϕ) + +y (i) log φy + (1 − y (i) ) log(1 − φy ) + += ∇φy +i=1 +m + +y (i) + += +i=1 + +1 +1 +− (1 − y (i) ) +φy +1 − φy + +Then setting ∇φy = 0 gives us +m + +y (i) + +0 = +i=1 +m + +1 +1 +− (1 − y (i) ) +φy +1 − φy + +y (i) (1 − φy ) − (1 − y (i) )φy + += +i=1 +m + +m + +y (i) − + += +i=1 + +Therefore, + +φy . +i=1 + +m +i=1 + +1{y (i) = 1} +. +m +(c) Consider making a prediction on some new data point x using the most likely class +estimate generated by the naive Bayes algorithm. Show that the hypothesis returned +by naive Bayes is a linear classifier—i.e., if p(y = 0|x) and p(y = 1|x) are the class +probabilities returned by naive Bayes, show that there exists some θ ∈ Rn+1 such +that +1 +p(y = 1|x) ≥ p(y = 0|x) if and only if θT +≥ 0. +x +φy = + +(Assume θ0 is an intercept term.) +Answer: +p(y = 1|x) ≥ p(y = 0|x) +p(y = 1|x) +⇐⇒ +≥1 +p(y = 0|x) +⇐⇒ + +n +j=1 + +p(xj |y = 1) p(y = 1) + +n +j=1 + +p(xj |y = 0) p(y = 0) + +n +xj +j=1 (φj|y=0 ) (1 + +⇐⇒ + +n +xj +j=1 (φj|y=1 ) (1 +n + +xj log + +⇐⇒ +j=1 +n + +xj log + +⇐⇒ +j=1 + +⇐⇒ θT + +1 +x + +φj|y=1 +φj|y=0 + +≥1 + +− φj|y=0 )1−xj φy + +− φj|y=1 )1−xj (1 − φy ) ++ (1 − xj ) log + +(φj|y=1 )(1 − φj|y=0 ) +(φj|y=0 )(1 − φj|y=1 ) +≥ 0, + +≥1 + +1 − φj|y=0 +1 − φj |y = 0 +n + +log + ++ +j=1 + ++ log + +1 − φj|y=1 +1 − φj|y=0 + +φy +1 − φy + ++ log + +≥0 + +φy +1 − φy + +≥0 + + 9 + +CS229 Problem Set #1 Solutions + +where +n + +θ0 + +1 − φj|y=1 +1 − φj|y=0 + +log + += +j=1 + +θj + += + ++ log + +(φj|y=1 )(1 − φj|y=0 ) +(φj|y=0 )(1 − φj|y=1 ) + +log + +φy +1 − φy +, j = 1, . . . , n. + +5. Exponential family and the geometric distribution +(a) Consider the geometric distribution parameterized by φ: +p(y; φ) = (1 − φ)y−1 φ, y = 1, 2, 3, . . . . +Show that the geometric distribution is in the exponential family, and give b(y), η, +T (y), and a(η). +Answer: +p(y; φ) + += + +(1 − φ)y−1 φ + += += + +exp log(1 − φ)y−1 + log φ +exp [(y − 1) log(1 − φ) + log φ] +1−φ +exp y log(1 − φ) − log +φ + += +Then +b(y) + += + +1 + +η = log(1 − φ) +T (y) = y +1−φ +a(η) = log +φ + += log + +eη +1 − eη + +, + +where the last line follows becuase η = log(1 − φ) ⇒ eη = 1 − φ ⇒ φ = 1 − eη . +(b) Consider performing regression using a GLM model with a geometric response variable. What is the canonical response function for the family? You may use the fact +that the mean of a geometric distribution is given by 1/φ. +Answer: +1 +1 +g(η) = E[y; φ] = = +. +φ +1 − eη +(c) For a training set {(x(i) , y (i) ); i = 1, . . . , m}, let the log-likelihood of an example +be log p(y (i) |x(i) ; θ). By taking the derivative of the log-likelihood with respect to +θj , derive the stochastic gradient ascent rule for learning using a GLM model with +goemetric responses y and the canonical response function. +Answer: The log-likelihood of an example (x(i) , y (i) ) is defined as ℓ(θ) = log p(y (i) |x(i) ; θ). +To derive the stochastic gradient ascent rule, use the results from previous parts and the +standard GLM assumption that η = θT x. + + 10 + +CS229 Problem Set #1 Solutions + +(i) + +T + +ℓi (θ) + += log exp θT x(i) · y (i) − log += log exp θT x(i) · y (i) − log +T + +(i) + += θ x +∂ +ℓi (θ) +∂θj + +(i) + +·y + +(i) + += xj y (i) + + +−θ T x(i) + ++ log e +e−θ + +T + +x(i) + +eθ x +1 − eθT x(i) +1 +e−θT x(i) + +−1 + +(i) + +(−xj ) +e−θT x(i) − 1 +1 +(i) +(i) += xj y (i) − +x +−θ +1 − e T x(i) j +1 +(i) += +y (i) − +xj . +1 − eθT x(i) + +Thus the stochastic gradient ascent update rule should be +θj := θj + α +which is +θj := θj + α y (i) − + +∂ℓi (θ) +, +∂θj + +1 +1 − eθT x(i) + +(i) + +xj . + +−1 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/ps2_solution.txt b/Lectures/aimlcs229/ps2_solution.txt new file mode 100644 index 0000000..ca3fe48 --- /dev/null +++ b/Lectures/aimlcs229/ps2_solution.txt @@ -0,0 +1,720 @@ +1 + +CS229 Problem Set #2 Solutions + +CS 229, Public Course +Problem Set #2 Solutions: +Theory + +Kernels, SVMs, and + +1. Kernel ridge regression +In contrast to ordinary least squares which has a cost function +J(θ) = + +1 +2 + +m + +(θT x(i) − y (i) )2 , +i=1 + +we can also add a term that penalizes large weights in θ. In ridge regression, our least +squares cost is regularized by adding a term λ θ 2 , where λ > 0 is a fixed (known) constant +(regularization will be discussed at greater length in an upcoming course lecutre). The ridge +regression cost function is then +J(θ) = + +1 +2 + +m + +(θT x(i) − y (i) )2 + +i=1 + +λ +θ 2. +2 + +(a) Use the vector notation described in class to find a closed-form expreesion for the +value of θ which minimizes the ridge regression cost function. +Answer: Using the design matrix notation, we can rewrite J(θ) as +J(θ) = + +1 +λ +(Xθ − y)T (Xθ − y) + θT θ. +2 +2 + +Then the gradient is +∇θ J(θ) = X T Xθ − X T y + λθ. +Setting the gradient to 0 gives us += X T Xθ − X T y + λθ += (X T X + λI)−1 X T y. + +0 +θ + +(b) Suppose that we want to use kernels to implicitly represent our feature vectors in a +high-dimensional (possibly infinite dimensional) space. Using a feature mapping φ, +the ridge regression cost function becomes +J(θ) = + +1 +2 + +m + +(θT φ(x(i) ) − y (i) )2 + +i=1 + +λ +θ 2. +2 + +Making a prediction on a new input xnew would now be done by computing θT φ(xnew ). +Show how we can use the “kernel trick” to obtain a closed form for the prediction +on the new input without ever explicitly computing φ(xnew ). You may assume that +the parameter vector θ can be expressed as a linear combination of the input feature +m +vectors; i.e., θ = i=1 αi φ(x(i) ) for some set of parameters αi . + + 2 + +CS229 Problem Set #2 Solutions + +[Hint: You may find the following identity useful: +(λI + BA)−1 B = B(λI + AB)−1 . +If you want, you can try to prove this as well, though this is not required for the +problem.] +Answer: Let Φ be the design matrix associated with the feature vectors φ(x(i) ). Then +from parts (a) and (b), +θ + += += += + +ΦT Φ + λI + +−1 + +ΦT ΦΦT + λI +T + +−1 + +Φ (K + λI) + +ΦT y +−1 + +y + +y. + +where K is the kernel matrix for the training set (since Φi,j = φ(x(i) )T φ(x(j) ) = Kij .) +To predict a new value ynew , we can compute +ynew + += θT φ(xnew ) += y T (K + λI)−1 Φφ(xnew ) +m + +αi K(x(i) , xnew ). + += +i=1 +−1 + +where α = (K + λI) y. All these terms can be efficiently computing using the kernel +function. +To prove the identity from the hint, we left-multiply by λ(I + BA) and right-multiply by +λ(I + AB) on both sides. That is, +(λI + BA)−1 B +B + += B(λI + AB)−1 += (λI + BA)B(λI + AB)−1 + +B(λI + AB) = (λI + BA)B +λB + BAB = λB + BAB. +This last line clearly holds, proving the identity. +2. ℓ2 norm soft margin SVMs +In class, we saw that if our data is not linearly separable, then we need to modify our +support vector machine algorithm by introducing an error margin that must be minimized. +Specifically, the formulation we have looked at is known as the ℓ1 norm soft margin SVM. +In this problem we will consider an alternative method, known as the ℓ2 norm soft margin +SVM. This new algorithm is given by the following optimization problem (notice that the +slack penalties are now squared): +minw,b,ξ +s.t. + +1 +2 + +m + +w 2 + C2 i=1 ξi2 +. +(i) +y (wT x(i) + b) ≥ 1 − ξi , i = 1, . . . , m + +(a) Notice that we have dropped the ξi ≥ 0 constraint in the ℓ2 problem. Show that these +non-negativity constraints can be removed. That is, show that the optimal value of +the objective will be the same whether or not these constraints are present. +Answer: Consider a potential solution to the above problem with some ξ < 0. Then +the constraint y (i) (wT x(i) + b) ≥ 1 − ξi would also be satisfied for ξi = 0, and the +objective function would be lower, proving that this could not be an optimal solution. + + 3 + +CS229 Problem Set #2 Solutions + +(b) What is the Lagrangian of the ℓ2 soft margin SVM optimization problem? +Answer: +1 +C +L(w, b, ξ, α) = wT w + +2 +2 + +m + +m + +ξi2 + +αi [y (i) (wT x(i) + b) − 1 + ξi ], + +− +i=1 + +i=1 + +where αi ≥ 0 for i = 1, . . . , m. +(c) Minimize the Lagrangian with respect to w, b, and ξ by taking the following gradients: +T +∇w L, ∂L +∂b , and ∇ξ L, and then setting them equal to 0. Here, ξ = [ξ1 , ξ2 , . . . , ξm ] . +Answer: Taking the gradient with respect to w, we get +m + +αi y (i) x(i) , + +0 = ∇w L = w − +i=1 + +which gives us + +m + +αi y (i) x(i) . + +w= +i=1 + +Taking the derivative with respect to b, we get +m + +0= + +∂L +αi y (i) , +=− +∂b +i=1 + +giving us + +m + +αi y (i) . + +0= +i=1 + +Finally, taking the gradient with respect to ξ, we have +0 = ∇ξ L = Cξ − α, +where α = [α1 , α2 , . . . , αm ]T . Thus, for each i = 1, . . . , m, we get +0 = Cξi − αi + +⇒ + +Cξi = αi . + +(d) What is the dual of the ℓ2 soft margin SVM optimization problem? + + 4 + +CS229 Problem Set #2 Solutions + +Answer: + +The objective function for the dual is + +W (α) + += += + +min L(w, b, ξ, α) + +w,b,ξ +m + +1 +2 + +m + +(αi y (i) x(i) )T (αj y (j) x(j) ) + +i=1 j=1 +m + +− +i=1 + += − + +1 +2 + + + + + + + +αi y (i)  +m + +m + +i=1 j=1 +m + +m + +αi − + += +i=1 +m + +αi − + += +i=1 + +i=1 + +1 +2 +1 +2 + +1 +2 + +m + +αi ξi +i=1 + +αi ξi + +αi − + +b+ + +i=1 + +i=1 + +m + +m + + + +m + +m + +αi y (i) + +i=1 + +αi 2 +ξ +ξi i + + + + +αj y (j) x(j)  x(i) + b − 1 + ξi  + +αi αj y (i) y (j) (x(i) )T x(j) + + +− + +m + +T + +m + +j=1 + +1 +2 + +αi αj y (i) y (j) (x(i) )T x(j) − +i=1 j=1 +m m + +αi αj y (i) y (j) (x(i) )T x(j) − +i=1 j=1 + +1 +2 +1 +2 + +m + +αi ξi +i=1 +m + +i=1 + +αi2 +. +C + +Then the dual formulation of our problem is +maxα +s.t. + +m +i=1 + +m + +αi − 12 i=1 +αi ≥ 0, i = 1, . . . , m +m +(i) +=0 +i=1 αi y + +m +j=1 + +αi αj y (i) y (j) (x(i) )T x(j) − + +1 +2 + +m α2i +i=1 C + +. + +3. SVM with Gaussian kernel +Consider the task of training a support vector machine using the Gaussian kernel K(x, z) = +exp(− x − z 2 /τ 2 ). We will show that as long as there are no two identical points in the +training set, we can always find a value for the bandwidth parameter τ such that the SVM +achieves zero training error. +(a) Recall from class that the decision function learned by the support vector machine +can be written as +m + +αi y (i) K(x(i) , x) + b. + +f (x) = +i=1 +(1) + +Assume that the training data {(x , y (1) ), . . . , (x(m) , y (m) )} consists of points which +are separated by at least a distance of ǫ; that is, ||x(j) − x(i) || ≥ ǫ for any i = j. +Find values for the set of parameters {α1 , . . . , αm , b} and Gaussian kernel width τ +such that x(i) is correctly classified, for all i = 1, . . . , m. [Hint: Let αi = 1 for all i +and b = 0. Now notice that for y ∈ {−1, +1} the prediction on x(i) will be correct if +|f (x(i) ) − y (i) | < 1, so find a value of τ that satisfies this inequality for all i.] + + 5 + +CS229 Problem Set #2 Solutions + +Answer: +First we set αi = 1 for all i = 1, . . . , m and b = 0. Then, for a training +example {x(i) , y (i) }, we get +m + +f (x(i) ) − y (i) + +y (j) K(x(j) , x(i) ) − y (i) + += +j=1 +m + +y (j) exp − x(j) − x(i) 2 /τ 2 − y (i) + += +j=1 + += + +y (i) + + +y (j) exp + +x(j) − x(i) 2 /τ 2 − y (i) + +j=i + +y (j) exp − x(j) − x(i) 2 /τ 2 + += +j=i + +y (j) exp − x(j) − x(i) 2 /τ 2 + +≤ +j=i + +y (j) · exp + += + +x(j) − x(i) 2 /τ 2 + +j=i + +exp − x(j) − x(i) 2 /τ 2 + += +j=i + +exp −ǫ2 /τ 2 + +≤ +j=i + += + +(m − 1) exp −ǫ2 /τ 2 . + +The first inequality comes from repeated application of the triangle inequality |a + b| ≤ +|a| + |b|, and the second inequality (1) from the assumption that ||x(j) − x(i) || ≥ ǫ for all +i = j. Thus we need to choose a γ such that +(m − 1) exp(−ǫ2 /τ 2 ) < 1, +or +τ< + +ǫ +. +log(m − 1) + +By choosing, for example, τ = ǫ/ log m we are done. +(b) Suppose we run a SVM with slack variables using the parameter τ you found in part +(a). Will the resulting classifier necessarily obtain zero training error? Why or why +not? A short explanation (without proof) will suffice. +Answer: The classifier will obtain zero training error. The SVM without slack variables +will always return zero training error if it is able to find a solution, so all that remains to +be shown is that there exists at least one feasible point. +Consider the constraint y (i) (wT x(i) + b) for some i, and let b = 0. Then +y (i) (wT x(i) + b) = y (i) · f (x(i) ) > 0 +since f (x(i) ) and y (i) have the same sign, and shown above. Therefore, as we choose all +the αi ’s large enough, y (i) (wT x(i) + b) > 1, so the optimization problem is feasible. + + CS229 Problem Set #2 Solutions + +6 + +(c) Suppose we run the SMO algorithm to train an SVM with slack variables, under +the conditions stated above, using the value of τ you picked in the previous part, +and using some arbitrary value of C (which you do not know beforehand). Will this +necessarily result in a classifier that achieve zero training error? Why or why not? +Again, a short explanation is sufficient. +Answer: The resulting classifier will not necessarily obtain zero training error. The C +m +parameter controls the relative weights of the (C i=1 ξi ) and ( 12 ||w||2 ) terms of the SVM +training objective. If the C parameter is sufficiently small, then the former component will +have relatively little contribution to the objective. In this case, a weight vector which has +a very small norm but does not achieve zero training error may achieve a lower objective +value than one which achieves zero training error. For example, you can consider the +extreme case where C = 0, and the objective is just the norm of w. In this case, w = 0 is +the solution to the optimization problem regardless of the choise of τ , this this may not +obtain zero training error. +4. Naive Bayes and SVMs for Spam Classification +In this question you’ll look into the Naive Bayes and Support Vector Machine algorithms +for a spam classification problem. However, instead of implementing the algorithms yourself, you’ll use a freely available machine learning library. There are many such libraries +available, with different strengths and weaknesses, but for this problem you’ll use the +WEKA machine learning package, available at http://www.cs.waikato.ac.nz/ml/weka/. +WEKA implements many standard machine learning algorithms, is written in Java, and +has both a GUI and a command line interface. It is not the best library for very large-scale +data sets, but it is very nice for playing around with many different algorithms on medium +size problems. +You can download and install WEKA by following the instructions given on the website +above. To use it from the command line, you first need to install a java runtime environment, then add the weka.jar file to your CLASSPATH environment variable. Finally, you +can call WEKA using the command: +java -t -T +For example, to run the Naive Bayes classifier (using the multinomial event model) on our +provided spam data set by running the command: +java weka.classifiers.bayes.NaiveBayesMultinomial -t spam train 1000.arff -T spam test.arff + +The spam classification dataset in the q4/ directory was provided courtesy of Christian +Shelton (cshelton@cs.ucr.edu). Each example corresponds to a particular email, and each +feature correspondes to a particular word. For privacy reasons we have removed the actual +words themselves from the data set, and instead label the features generically as f1, f2, etc. +However, the data set is from a real spam classification task, so the results demonstrate the +performance of these algorithms on a real-world problem. The q4/ directory actually contains several different training files, named spam train 50.arff, spam train 100.arff, +etc (the “.arff” format is the default format by WEKA), each containing the corresponding +number of training examples. There is also a single test set spam test.arff, which is a +hold out set used for evaluating the classifier’s performance. +(a) Run the weka.classifiers.bayes.NaiveBayesMultinomial classifier on the dataset +and report the resulting error rates. Evaluate the performance of the classifier using +each of the different training files (but each time using the same test file, spam test.arff). +Plot the error rate of the classifier versus the number of training examples. + + 7 + +CS229 Problem Set #2 Solutions + +(b) Repeat the previous part, but using the weka.classifiers.functions.SMO classifier, +which implements the SMO algorithm to train an SVM. How does the performance +of the SVM compare to that of Naive Bayes? +Answer: +Using the above command line arguments to run the classifier, we obtain +the following error rates for the two algorithms: +8 +Support Vector Machine +Naive Bayes + +7 + +Error (%) + +6 +5 +4 +3 +2 +1 +0 +0 + +200 + +400 + +600 + +800 + +1000 + +1200 + +1400 + +1600 + +1800 + +2000 + +Num Training Examples + +For small amounts of data, Naive Bayes performs better than the Support Vector Machine. +However, as the amount of data grows, the SVM achieves a better error rate. +5. Uniform convergence +In class we proved that for any finite set of hypotheses H = {h1 , . . . , hk }, if we pick the +ˆ that minimizes the training error on a set of m examples, then with probability +hypothesis h +at least (1 − δ), +1 +2k +ˆ ≤ min ε(hi ) + 2 +ε(h) +log , +i +2m +δ +where ε(hi ) is the generalization error of hypothesis hi . Now consider a special case (often +called the realizable case) where we know, a priori, that there is some hypothesis in our +class H that achieves zero error on the distribution from which the data is drawn. Then +we could obviously just use the above bound with mini ε(hi ) = 0; however, we can prove a +better bound than this. +(a) Consider a learning algorithm which, after looking at m training examples, chooses +ˆ ∈ H that makes zero mistakes on this training data. (By our +some hypothesis h +assumption, there is at least one such hypothesis, possibly more.) Show that with +probability 1 − δ +ˆ ≤ 1 log k . +ε(h) +m +δ +Notice that since we do not have a square root here, this bound is much tighter. [Hint: +Consider the probability that a hypothesis with generalization error greater than γ +makes no mistakes on the training data. Instead of the Hoeffding bound, you might +also find the following inequality useful: (1 − γ)m ≤ e−γm .] +Answer: Let h ∈ H be a hypothesis with true error greater than γ. Then +P (“h predicts correctly”) ≤ 1 − γ, + + 8 + +CS229 Problem Set #2 Solutions + +so +P (“h predicts correctly m times”) ≤ (1 − γ)m ≤ e−γm . +Applying the union bound, +P (∃h ∈ H, s.t. ε(h) > γ and “h predicts correct m times”) ≤ ke−γm . +We want to make this probability equal to δ, so we set +ke−γm = δ, +which gives us +γ= + +k +1 +log . +m +δ + +This impiles that with probability 1 − δ, +ˆ ≤ +ε(h) + +1 +k +log . +m +δ + +(b) Rewrite the above bound as a sample complexity bound, i.e., in the form: for fixed +ˆ ≤ γ to hold with probability at least (1 − δ), it suffices that m ≥ +δ and γ, for ε(h) +f (k, γ, δ) (i.e., f (·) is some function of k, γ, and δ). +Answer: From part (a), if we take the equation, +ke−γm = δ +and solve for m, we obtain +m= + +1 +k +log . +γ +δ + +ˆ ≤ γ will hold with probability at least 1 − δ. +Therefore, for m larger than this, ε(h) + + \ No newline at end of file diff --git a/Lectures/aimlcs229/ps3_solution.txt b/Lectures/aimlcs229/ps3_solution.txt new file mode 100644 index 0000000..c12207c --- /dev/null +++ b/Lectures/aimlcs229/ps3_solution.txt @@ -0,0 +1,1155 @@ +1 + +CS229 Problem Set #3 Solutions + +CS 229, Public Course +Problem Set #3 Solutions: Learning Theory and +Unsupervised Learning +1. Uniform convergence and Model Selection +In this problem, we will prove a bound on the error of a simple model selection procedure. +Let there be a binary classification problem with labels y ∈ {0, 1}, and let H1 ⊆ H2 ⊆ +. . . ⊆ Hk be k different finite hypothesis classes (|Hi | < ∞). Given a dataset S of m iid +training examples, we will divide it into a training set Strain consisting of the first (1 − β)m +examples, and a hold-out cross validation set Scv consisting of the remaining βm examples. +Here, β ∈ (0, 1). +ˆ i = arg minh∈H εˆS +(h) be the hypothesis in Hi with the lowest training error +Let h +train +i +ˆ +(on Strain ). Thus, hi would be the hypothesis returned by training (with empirical risk +minimization) using hypothesis class Hi and dataset Strain . Also let h⋆i = arg minh∈Hi ε(h) +be the hypothesis in Hi with the lowest generalization error. +ˆ i ’s using empirical risk minimization then +Suppose that our algorithm first finds all the h +ˆ 1, . . . , h +ˆ k } with +uses the hold-out cross validation set to select a hypothesis from this the {h +minimum training error. That is, the algorithm will output +ˆ = arg +h + +min + +ˆ 1 ,...,h +ˆk} +h∈{h + +εˆScv (h). + +For this question you will prove the following bound. Let any δ > 0 be fixed. Then with +probability at least 1 − δ, we have that +ˆ ≤ min +ε(h) + +i=1,...,k + +ε(h∗i ) + + +4|Hi | +2 +log +(1 − β)m +δ + ++ + +4k +2 +log +2βm +δ + +ˆi, +(a) Prove that with probability at least 1 − 2δ , for all h +ˆ i )| ≤ +ˆ i ) − εˆS (h +|ε(h +cv + +1 +4k +log . +2βm +δ + +ˆ i , the empirical error on the cross-validation set, εˆ(h +ˆ i ) represents +Answer: For each h +ˆ +the average of βm random variables with mean ε(hi ), so by the Hoeffding inequality for +ˆ i, +any h +ˆ i )| ≥ γ) ≤ 2 exp(−2γ 2 βm). +ˆ i ) − εˆS (h +P (|ε(h +cv +ˆ i , we need to take the union over +As in the class notes, to insure that this holds for all h +ˆ +all k of the hi ’s. +ˆ i )| ≥ γ) ≤ 2k exp(−2γ 2 βm). +ˆ i ) − εˆS (h +P (∃i, s.t.|ε(h +cv + + 2 + +CS229 Problem Set #3 Solutions + +Setting this term equal to δ/2 and solving for γ yields +1 +4k +log +2βm +δ + +γ= +proving the desired bound. + +(b) Use part (a) to show that with probability 1 − 2δ , +2 +4k +log . +βm +δ + +ˆ ≤ min ε(h +ˆ i) + +ε(h) +i=1,...,k + +Answer: + +ˆ i ). Using part (a), with probability at least 1 − +Let j = arg mini ε(h +ˆ +ε(h) + +ˆ + +≤ εˆScv (h) += + +i + +ˆj ) + 2 +≤ ε(h += + +1 +4k +log +2βm +δ + +ˆ i) + +min εˆScv (h + +ˆj ) + +≤ εˆScv (h + +1 +4k +log +2βm +δ + +1 +4k +log +2βm +δ +4k +1 +log +2βm +δ + +ˆ i) + +min ε(h + +i=1,...,k + +2 +4k +log +βm +δ + +ˆ i ). We know from class that for Hj , with probability 1 − +(c) Let j = arg mini ε(h +ˆ j ) − εˆS +(h⋆j )| ≤ +|ε(h +train + +δ +2 + +δ +2 + +4|Hj | +2 +log +, ∀hj ∈ Hj . +(1 − β)m +δ + +Use this to prove the final bound given at the beginning of this problem. +Answer: +The bounds in parts (a) and (c) both hold simultaneously with probability +2 +(1 − 2δ )2 = 1 − δ + δ4 > 1 − δ, so with probablity greater than 1 − δ, +ˆ ≤ ε(h⋆ ) + 2 +ε(h) +j + +1 +2|Hj | +log δ + 2 +2(1 − γ)m +2 + +1 +2k +log δ +2γm +2 + +which is equivalent to the bound we want to show. +2. VC Dimension +Let the input domain of a learning problem be X = R. Give the VC dimension for each +of the following classes of hypotheses. In each case, if you claim that the VC dimension is +d, then you need to show that the hypothesis class can shatter d points, and explain why +there are no d + 1 points it can shatter. + + CS229 Problem Set #3 Solutions + +3 + +• h(x) = 1{a < x}, with parameter a ∈ R. +Answer: VC-dimension = 1. +(a) It can shatter point {0}, by choosing a to be 2 and −2. +(b) It cannot shatter any two points {x1 , x2 }, x1 < x2 , because the labelling x1 = 1 and +x2 = 0 cannot be realized. +• h(x) = 1{a < x < b}, with parameters a, b ∈ R. +Answer: VC-dimension = 2. +(a) It can shatter points {0, 2} by choosing (a, b) to be (3, 5), (−1, 1), (1, 3), (−1, 3). +(b) It cannot shatter any three points {x1 , x2 , x3 }, x1 < x2 < x3 , because the labelling +x1 = x3 = 1, x2 = 0 cannot be realized. +• h(x) = 1{a sin x > 0}, with parameter a ∈ R. +Answer: VC-dimension = 1. a controls the amplitude of the sine curve. +(a) It can shatter point { π2 } by choosing a to be 1 and −1. +(b) It cannot shatter any two points {x1 , x2 }, since, the labellings of x1 and x2 will flip +together. If x1 = x2 = 1 for some a, then we cannot achieve x1 = x2 . If x1 = x2 +for some a, then we cannot achieve x1 = x2 = 1 (x1 = x2 = 0 can be achieved by +setting a = 0). +• h(x) = 1{sin(x + a) > 0}, with parameter a ∈ R. +Answer: VC-dimension = 2. a controls the phase of the sine curve. +π +3π +(a) It can shatter points { π4 , 3π +4 }, by choosing a to be 0, 2 , π, and 2 . +(b) It cannot shatter any three points {x1 , x2 , x3 }. Since sine has a preiod of 2π, let’s +define x′i = xi mod 2π. W.l.o.g., assume x′1 < x′2 < x′3 . If the labelling of +x1 = x2 = x3 = 1 can be realized, then the labelling of x1 = x3 = 1, x2 = 0 will +not be realizable. Notice the similarity to the second question. + +3. ℓ1 regularization for least squares +In the previous problem set, we looked at the least squares problem where the objective +function is augmented with an additional regularization term λ θ 22 . In this problem we’ll +consider a similar regularized objective but this time with a penalty on the ℓ1 norm of +the parameters λ θ 1 , where θ 1 is defined as i |θi |. That is, we want to minimize the +objective +n +m +1 +J(θ) = +|θi |. +(θT x(i) − y (i) )2 + λ +2 i=1 +i=1 +There has been a great deal of recent interest in ℓ1 regularization, which, as we will see, +has the benefit of outputting sparse solutions (i.e., many components of the resulting θ are +equal to zero). +The ℓ1 regularized least squares problem is more difficult than the unregularized or ℓ2 +regularized case, because the ℓ1 term is not differentiable. However, there have been many +efficient algorithms developed for this problem that work very well in practive. One very +straightforward approach, which we have already seen in class, is the coordinate descent +method. In this problem you’ll derive and implement a coordinate descent algorithm for +ℓ1 regularized least squares, and apply it to test data. + + 4 + +CS229 Problem Set #3 Solutions + +(a) Here we’ll derive the coordinate descent update for a given θi . Given the X and +y matrices, as defined in the class notes, as well a parameter vector θ, how can we +adjust θi so as to minimize the optimization objective? To answer this question, we’ll +rewrite the optimization objective above as +J(θ) = + +1 +Xθ − y +2 + +2 +2 + ++λ θ + +1 + += + +1 +X θ¯ + Xi θi − y +2 + +2 +2 + ++ λ θ¯ + +1 + ++ λ|θi | + +where Xi ∈ Rm denotes the ith column of X, and θ¯ is equal to θ except with θ¯i = 0; +all we have done in rewriting the above expression is to make the θi term explicit in +the objective. However, this still contains the |θi | term, which is non-differentiable +and therefore difficult to optimize. To get around this we make the observation that +the sign of θi must either be non-negative or non-positive. But if we knew the sign of +θi , then |θi | becomes just a linear term. That, is, we can rewrite the objective as +J(θ) = + +1 +X θ¯ + Xi θi − y +2 + +2 +2 + ++ λ θ¯ + +1 + ++ λsi θi + +where si denotes the sign of θi , si ∈ {−1, 1}. In order to update θi , we can just +compute the optimal θi for both possible values of si (making sure that we restrict +the optimal θi to obey the sign restriction we used to solve for it), then look to see +which achieves the best objective value. +For each of the possible values of si , compute the resulting optimal value of θi . [Hint: +to do this, you can fix si in the above equation, then differentiate with respect to θi +to find the best value. Finally, clip θi so that it lies in the allowable range — i.e., for +si = 1, you need to clip θi such that θi ≥ 0.] +Answer: For si = 1, +J(θ) + += += + +1 +tr(X θ¯ + Xi θi − y)T (X θ¯ + Xi θi − y) + λ θ¯ 1 + λθi +2 +1 +XiT Xi θi2 + 2XiT (X θ¯ − y)θi + X θ¯ − y 22 + λ θ¯ 1 + λθi , +2 + +so + +∂J(θ) += XiT Xi θ + XiT (X θ¯ − y) + λ +∂θi +which means the optimal θi is given by +θi = max + +−XiT (X θ¯ − y) − λ +,0 . +XiT Xi + +Similarly, for si = −1, the optimal θi is given by +θi = min + +−XiT (X θ¯ − y) + λ +,0 . +XiT Xi + +(b) Implement the above coordinate descent algorithm using the updates you found in +the previous part. We have provided a skeleton theta = l1ls(X,y,lambda) function +in the q3/ directory. To implement the coordinate descent algorithm, you should +repeatedly iterate over all the θi ’s, adjusting each as you found above. You can +terminate the process when θ changes by less than 10− 5 after all n of the updates. +Answer: The following is our implementation of l1ls.m: + + CS229 Problem Set #3 Solutions + +5 + +function theta = l1l2(X,y,lambda) +m = size(X,1); +n = size(X,2); +theta = zeros(n,1); +old_theta = ones(n,1); +while (norm(theta - old_theta) > 1e-5) +old_theta = theta; +for i=1:n, +% compute possible values for theta +theta(i) = 0; +theta_i(1) = max((-X(:,i)’*(X*theta - y) - lambda) / (X(:,i)’*X(:,i)), 0); +theta_i(2) = min((-X(:,i)’*(X*theta - y) + lambda) / (X(:,i)’*X(:,i)), 0); +% get objective value for both possible thetas +theta(i) = theta_i(1); +obj_theta(1) = 0.5*norm(X*theta - y)^2 + lambda*norm(theta,1); +theta(i) = theta_i(2); +obj_theta(2) = 0.5*norm(X*theta - y)^2 + lambda*norm(theta,1); +% pick the theta which minimizes the objective +[min_obj, min_ind] = min(obj_theta); +theta(i) = theta_i(min_ind); +end +end +(c) Test your implementation on the data provided in the q3/ directory. The [X, y, +theta true] = load data; function will load all the data — the data was generated +by y = X*theta true + 0.05*randn(20,1), but theta true is sparse, so that very +few of the columns of X actually contain relevant features. Run your l1ls.m implementation on this data set, ranging λ from 0.001 to 10. Comment briefly on how this +algorithm might be used for feature selection. +Answer: For λ = 1, our implementation of l1 regularized least squares recovers the +exact sparsity pattern of the true parameter that generated the data. In constrast, using +any amount of l2 regularization still leads to θ’s that contain no zeros. This suggests +that the l1 regularization could be very useful as a feature selection algorithm: we could +run l1 regularized least squares to see which coefficients are non-zero, then select only +these features for use with either least-squares or possibly a completely different machine +learning algorithm. +4. K-Means Clustering +In this problem you’ll implement the K-means clustering algorithm on a synthetic data +set. There is code and data for this problem in the q4/ directory. Run load ’X.dat’; +to load the data file for clustering. Implement the [clusters, centers] = k means(X, +k) function in this directory. As input, this function takes the m × n data matrix X and +the number of clusters k. It should output a m element vector, clusters, which indicates +which of the clusters each data point belongs to, and a k × n matrix, centers, which +contains the centroids of each cluster. Run the algorithm on the data provided, with k = 3 + + 6 + +CS229 Problem Set #3 Solutions + +and k = 4. Plot the cluster assignments and centroids for each iteration of the algorithm +using the draw clusters(X, clusters, centroids) function. For each k, be sure to run +the algorithm several times using different initial centroids. +The following is our implementation of k means.m: + +Answer: + +function [clusters, centroids] = k_means(X, k) +m = size(X,1); +n = size(X,2); +oldcentroids = zeros(k,n); +centroids = X(ceil(rand(k,1)*m),:); +while (norm(oldcentroids - centroids) > 1e-15) +oldcentroids = centroids; +% compute cluster assignments +for i=1:m, +dists = sum((repmat(X(i,:), k, 1) - centroids).^2, 2); +[min_dist, clusters(i,1)] = min(dists); +end +draw_clusters(X, clusters, centroids); +pause(0.1); +% compute cluster centroids +for i=1:k, +centroids(i,:) = mean(X(clusters == i, :)); +end +end +Below we show the centroid evolution for two typical runs with k = 3. Note that the different +starting positions of the clusters lead to do different final clusterings. +1 + +1 + +1 + +1 + +1 + +1 + +0.8 + +0.8 + +0.8 + +0.8 + +0.8 + +0.8 + +0.6 + +0.6 + +0.6 + +0.6 + +0.6 + +0.6 + +0.4 + +0.4 + +0.4 + +0.4 + +0.4 + +0.4 + +0.2 + +0.2 + +0.2 + +0.2 + +0.2 + +0.2 + +0 + +0 + +0 + +0 + +0 + +0 + +−0.2 + +−0.2 + +−0.2 + +−0.2 + +−0.2 + +−0.2 + +−0.4 + +−0.4 + +−0.4 + +−0.4 + +−0.4 + +−0.4 + +−0.6 + +−0.6 + +−0.6 + +−0.6 + +−0.6 + +−0.6 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +1 + +1 + +1 + +1 + +1 + +1 + +0.8 + +0.8 + +0.8 + +0.8 + +0.8 + +0.8 + +0.6 + +0.6 + +0.6 + +0.6 + +0.6 + +0.6 + +0.4 + +0.4 + +0.4 + +0.4 + +0.4 + +0.4 + +0.2 + +0.2 + +0.2 + +0.2 + +0.2 + +0 + +0 + +0 + +0 + +0 + +−0.2 + +−0.2 + +−0.2 + +−0.2 + +−0.2 + +−0.4 + +−0.4 + +−0.4 + +−0.4 + +−0.4 + +−0.4 + +−0.6 + +−0.6 + +−0.6 + +−0.6 + +−0.6 + +−0.6 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +0.2 + +0 +−0.2 + +−0.8 +−1 + +−0.8 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +0.4 + +0.6 + +0.8 + +1 + +−0.8 +−1 + +5. The Generalized EM algorithm +When attempting to run the EM algorithm, it may sometimes be difficult to perform the M +step exactly — recall that we often need to implement numerical optimization to perform +the maximization, which can be costly. Therefore, instead of finding the global maximum +of our lower bound on the log-likelihood, and alternative is to just increase this lower bound +a little bit, by taking one step of gradient ascent, for example. This is commonly known +as the Generalized EM (GEM) algorithm. + + 7 + +CS229 Problem Set #3 Solutions + +Put slightly more formally, recall that the M-step of the standard EM algorithm performs +the maximization +Qi (z (i) ) log + +θ := arg max +θ + +i + +z (i) + +p(x(i) , z (i) ; θ) +. +Qi (z (i) ) + +The GEM algorithm, in constrast, performs the following update in the M-step: +Qi (z (i) ) log + +θ := θ + α∇θ +i + +z (i) + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + +where α is a learning rate which we assume is choosen small enough such that we do not +decrease the objective function when taking this gradient step. +(a) Prove that the GEM algorithm described above converges. To do this, you should +show that the the likelihood is monotonically improving, as it does for the EM algorithm — i.e., show that ℓ(θ(t+1) ) ≥ ℓ(θ(t) ). +Answer: We use the same logic as for the standard EM algorithm. Specifically, just +as for EM, we have for the GEM algorithm that +(t) + +Qi (z (i) ) log + +ℓ(θ(t+1) ) ≥ +i + +z (i) +(t) + +Qi (z (i) ) log + +≥ +i + += ℓ(θ + +z (i) +(t) + +p(x(i) , z (i) ; θ(t+1) ) +(t) + +Qi (z (i) ) +p(x(i) , z (i) ; θ(t) ) +(t) + +Qi (z (i) ) + +) + +where as in EM the first line holds due to Jensen’s equality, and the last line holds because +we choose the Q distribution to make this hold with equality. The only difference between +EM and GEM is the logic as to why the second line holds: for EM it held because θ(t+1) +was chosen to maximize this quantity, but for GEM it holds by our assumption that we +take a gradient step small enough so as not to decrease the objective function. +(b) Instead of using the EM algorithm at all, suppose we just want to apply gradient ascent +to maximize the log-likelihood directly. In other words, we are trying to maximize +the (non-convex) function +ℓ(θ) = + +p(x(i) , z (i) ; θ) + +log +i + +z (i) + +θ := θ + α∇θ + +log + +so we could simply use the update +i + +p(x(i) , z (i) ; θ). +z (i) + +Show that this procedure in fact gives the same update as the GEM algorithm described above. +Answer: Differentiating the log likelihood directly we get +∂ +∂θj + +p(x(i) , z (i) ; θ) + +log +i + += +i + +z (i) + +1 +(i) , z (i) ; θ) +p(x +z (i) + += +i + +z (i) + +z (i) + +∂ +p(x(i) , z (i) ; θ) +∂θj + +1 +∂ +· +p(x(i) , z (i) ; θ). +p(x(i) ; θ) ∂θj + + 8 + +CS229 Problem Set #3 Solutions + +For the GEM algorithm, +∂ +∂θj + +Qi (z (i) ) log +i + +z (i) + +p(x(i) , z (i) ; θ) +Qi (z (i) ) + += +i + +z (i) + +∂ +Qi (z (i) ) +· +p(x(i) , z (i) ; θ). +p(x(i) , z (i) ; θ) ∂θj + +But the E-step of the GEM algorithm chooses +Qi (z (i) ) = p(z (i) |x(i) ; θ) = + +p(x(i) , z (i) ; θ) +, +p(x(i) ; θ) + +so + +i + +z (i) + +∂ +Qi (z (i) ) +· +p(x(i) , z (i) ; θ) = +p(x(i) , z (i) ; θ) ∂θj + +i + +z (i) + +∂ +1 +· +p(x(i) , z (i) ; θ) +p(x(i) ; θ) ∂θj + +which is the same as the derivative of the log likelihood. + + \ No newline at end of file diff --git a/Lectures/aimlcs229/ps4_solution.txt b/Lectures/aimlcs229/ps4_solution.txt new file mode 100644 index 0000000..dbb46e5 --- /dev/null +++ b/Lectures/aimlcs229/ps4_solution.txt @@ -0,0 +1,1058 @@ +1 + +CS229 Problem Set #4 Solutions + +CS 229, Public Course +Problem Set #4 Solutions: Unsupervised Learning and Reinforcement Learning +1. EM for supervised learning +In class we applied EM to the unsupervised learning setting. In particular, we represented +p(x) by marginalizing over a latent random variable +p(x) = + +p(x, z) = +z + +p(x|z)p(z). +z + +However, EM can also be applied to the supervised learning setting, and in this problem we +discuss a “mixture of linear regressors” model; this is an instance of what is often call the +Hierarchical Mixture of Experts model. We want to represent p(y|x), x ∈ Rn and y ∈ R, +and we do so by again introducing a discrete latent random variable +p(y|x) = + +p(y, z|x) = +z + +p(y|x, z)p(z|x). +z + +For simplicity we’ll assume that z is binary valued, that p(y|x, z) is a Gaussian density, +and that p(z|x) is given by a logistic regression model. More formally += g(φT x)z (1 − g(φT x))1−z +−(y − θiT x)2 +1 +exp +p(y|x, z = i; θi ) = √ +2σ 2 +2πσ +p(z|x; φ) + +i = 1, 2 + +where σ is a known parameter and φ, θ0 , θ1 ∈ Rn are parameters of the model (here we +use the subscript on θ to denote two different parameter vectors, not to index a particular +entry in these vectors). +Intuitively, the process behind model can be thought of as follows. Given a data point x, +we first determine whether the data point belongs to one of two hidden classes z = 0 or +z = 1, using a logistic regression model. We then determine y as a linear function of x +(different linear functions for different values of z) plus Gaussian noise, as in the standard +linear regression model. For example, the following data set could be well-represented by +the model, but not by standard linear regression. + + 2 + +CS229 Problem Set #4 Solutions + +(a) Suppose x, y, and z are all observed, so that we obtain a training set +{(x(1) , y (1) , z (1) ), . . . , (x(m) , y (m) , z (m) )}. Write the log-likelihood of the parameters, +and derive the maximum likelihood estimates for φ, θ0 , and θ1 . Note that because +p(z|x) is a logistic regression model, there will not exist a closed form estimate of φ. +In this case, derive the gradient and the Hessian of the likelihood with respect to φ; +in practice, these quantities can be used to numerically compute the ML esimtate. +Answer: The log-likelihood is given by +m + +ℓ(φ, θ0 , θ1) + += + +p(y (i) |x(i) , z (i) ; θ0 , θ1 )p(z (i) |x(i) ; φ) + +log +i=1 + +log (1 − g(φT x)) √ + += +i:z (i) =0 + ++ + +1 +exp +2πσ + +1 +exp +log (g(φT x) √ +2πσ +(i) +i:z =1 + +−(y (i) − θ0T x(i) )2 +2σ 2 + +−(y (i) − θ1T x(i) )2 +2σ 2 + +Differentiating with respect to θ1 and setting it to 0, +set + +0 + +∇θ0 ℓ(φ, θ0 , θ1 ) + += + +∇θ + += + +i:z (i) =0 + +−(y (i) − θ0T x(i) )2 + +But this is just a least-squares problem on a subset of the data. In particular, if we let X0 +and y0 be the design matrices formed by considering only those examples with z (i) = 0, +then using the same logic as for the derivation of the least squares solution we get the +maximum likelihood estimate of θ0 , +θ0 = (X0T X0 )−1 X0T y0 . +The derivation for θ1 proceeds in the identical manner. +Differentiating with respect to φ, and ignoring terms that do not depend on φ +∇φ ℓ(φ, θ0 , θ1 ) = ∇φ += ∇φ + +m + +i=1 + +i : z (i) = 0 log(1 − g(φT x)) + + +i : z (i) = 1 log g(φT x) + +(1 − z (i) ) log(1 − g(φT x)) + z (i) log g(φT x) + +This is just the standard logistic regression objective function, for which we already know +the gradient and Hessian +∇φ ℓ(φ, θ0 , θ1 ) = X T (z − h), +H = X T DX, + +hi = g(φT x(i) ), + +Dii = g(φT x(i) )(1 − g(φT x(i) )). + +(b) Now suppose z is a latent (unobserved) random variable. Write the log-likelihood of +the parameters, and derive an EM algorithm to maximize the log-likelihood. Clearly +specify the E-step and M-step (again, the M-step will require a numerical solution, +so find the appropriate gradients and Hessians). + + 3 + +CS229 Problem Set #4 Solutions + +The log likelihood is now: + +Answer: + +m + +ℓ(φ, θ0 , θ1 ) + += + +log +i=1 z (i) + +p(y (i) |x(i) , z (i) ; θ1 , θ2 )p(z (i) |x(i) ; φ) + +m + += +i=1 + +log (1 − g(φT x(i) ))1−z ++ g(φT x(i) )z + +(i) + +(i) + +1 +√ +exp +2πσ + +1 +√ +exp +2πσ + +−(y (i) − θ0T x(i) )2 +2σ 2 + +−(y (i) − θ1T x(i) )2 +2σ 2 + +In the E-step of the EM algorithm we compute +Qi (z (i) ) = p(z (i) |x(i) , y (i) ; φ, θ0 , θ1 ) = + +p(y (i) |x(i) , z (i) ; θ0 , θ1 )p(z (i) |x(i) ; φ) +(i) (i) +(i) +z p(y |x , z; θ0 , θ1 )p(z|x ; φ) + +Every propability in this term can be computed using the probability densities defined in +the problem, so the E-step is tractable. +(i) + +For the M-step, we first define wj = p(z (i) = j|x(i) , y (i) ; φ, θ0 , θ1 ) for j = 0, 1 as +computed in the E-step (of course we only need to compute one of these terms in the +(i) +(i) +real E-step, since w0 = 1 − w1 , but we define both to simplify the expressions). +Differentiating our lower bound on the likelihood with respect to θ0 , removing terms that +don’t depend on θ0 , and setting the expression equal to zero, we get +m +set + +0 + +(i) + +∇θ0 + += + +wj log + +p(y (i) |x(i) , z (i) = j; θj )p(z (i) = j|x(i) ; φ) +(i) + +i=1 j=0,1 +m +(i) +∇θ0 +w0 log p(y (i) |x(i) , z (i) +i=1 +m +(i) +∇θ0 +−w0 (y (i) − θ0T x(i) )2 +i=1 + += += + +wj + += j; θj ) + +This is just a weighted least-squares problem, which has solution +θ0 = (X0T W X0 )−1 X0T W y0 , + +(1) + +(m) + +W = diag(w0 , . . . , w0 + +. + +The derivation for θ1 proceeds similarly. +Finally, as before, we can’t compute the M-step update for φ in closed form, so we instead +find the gradient and Hessian. However, to do this we note that +m + +∇φ + +(i) + +wj log + +p(y (i) |x(i) , z (i) = j; θj )p(z (i) = j|x(i) ; φ) + +(i) + +(i) + +wj log p(z (i) = j|x(i) ; φ) = +i=1 j=0,1 + += + +m + +m + +∇φ + +(i) + +wj + +i=1 j=0,1 + +i=1 + +(i) + +w0 log g(φT x) + (1 − w0 ) log(1 − g(φT x(i) )) + +This term is the same as the objective for logistic regression task, but with the w(i) +quantity replacing y (i) . Therefore, the gradient and Hessian are given by + + 4 + +CS229 Problem Set #4 Solutions + +m + +∇φ + +(i) + +i=1 j=0,1 + +wj log p(z (i) = j|x(i) ; φ) = X T (w − h), + +H = X T DX, + +hi = g(φT x(i) ), + +Dii = g(φT x(i) )(1 − g(φT x(i) )). + +2. Factor Analysis and PCA +In this problem we look at the relationship between two unsupervised learning algorithms +we discussed in class: Factor Analysis and Principle Component Analysis. +Consider the following joint distribution over (x, z) where z ∈ Rk is a latent random +variable +z +x|z + +∼ + +N (0, I) + +N (U z, σ 2 I). + +∼ + +where U ∈ Rn×k is a model parameters and σ 2 is assumed to be a known constant. This +model is often called Probabilistic PCA. Note that this is nearly identical to the factor +analysis model except we assume that the variance of x|z is a known scaled identity matrix +rather than the diagonal parameter matrix, Φ, and we do not add an additional µ term to +the mean (though this last difference is just for simplicity of presentation). However, as +we will see, it turns out that as σ 2 → 0, this model is equivalent to PCA. +For simplicity, you can assume for the remainder of the problem that k = 1, i.e., that U is +a column vector in Rn . + +(a) Use the rules for manipulating Gaussian distributions to determine the joint distribution over (x, z) and the conditional distribution of z|x. [Hint: for later parts of +this problem, it will help significantly if you simplify your soluting for the conditional +distribution using the identity we first mentioned in problem set #1: (λI +BA)−1 B = +B(λI + AB)−1 .] +Answer: +To compute the joint distribution, we compute the means and covariances +of x and z. First, E[z] = 0 and +E[x] = E[U z + ǫ] = U E[z] + E[ǫ] = 0, + +(where ǫ ∼ N (0, σ 2 I)). + +Since both x and z have zero mean +Σzz + += E[zz T ] = I (= 1, since z is a scalar when k = 1) + +Σzx +Σxx + += E[(U z + ǫ)z T ] = U E[zz T ] + E[ǫz T ] = U += E[(U z + ǫ)(U z + ǫ)T ] = E[U zz T U T + ǫz T U T + U zǫT + ǫǫT ] += U E[zz T ]U T + E[ǫǫT ] = U U T + σ 2 I + +Therefore, +z +x + +∼N + +0 +0 + +, + +1 +U + +UT +U U T + σ2 I + +. + +Using the rules for conditional Gaussian distributions, z|x is also Gaussian with mean and +covariance +µz|x + += U T (U U T + σ 2 I)−1 x = + +Σz|x + += + +UT x ++ σ2 + +UT U + +1 − U T (U U T + σ 2 I)−1 U = 1 − + +UT U +U T U + σ2 + + 5 + +CS229 Problem Set #4 Solutions + +where in both cases the last equality comes from the identity in the hint. +(b) Using these distributions, derive an EM algorithm for the model. Clearly state the +E-step and the M-step of the algorithm. +Answer: +Even though z (i) is a scalar value, in this problem we continue to use the +(i)T +notation z +, etc, to make the similarities to the Factor anlysis case obvious. +For the E-step, we compute the distribution Qi (z (i) ) = p(z (i) |x(i) ; U ) by computing +µz(i) |x(i) and Σz(i) |x(i) using the above formulas. +For the M-step, we need to maximize +m + +Qi (z (i) ) log +i=1 + +z (i) + +p(x(i) , |z (i) ; U )p(z (i) ) +Qi (z (i) ) + +m + += +i=1 + +Ez(i) ∼Qi log p(x(i) |z (i) ; U ) + log p(z (i) ) − log Qi (z (i) ) . + +Taking the gradient with respect to U equal to zero, dropping terms that don’t depend +on U , and omitting the subscript on the expectation, this becomes +m + +∇U + +i=1 + +m + +E log p(x(i) |z (i) ; U ) + += ∇U + +i=1 + +1 += − 2 +2σ += − += + +1 +2σ 2 + +1 +2σ 2 + +E − + +1 +(x(i) − U z (i) )T (x(i) − U z (i) ) +2σ 2 + +m +T + +i=1 +m + +T + +i=1 +m + +T + +∇U E trz (i) U T U z (i) − 2trz (i) U T x(i) +E U z (i) z (i) − x(i) z (i) + +T + +T + +i=1 + +T + +−U E[z (i) z (i) ] + x(i) E[z (i) ] + +using the same reasoning as in the Factor Analysis class notes. Setting this derivative to +zero gives +m + +U + += + +x E[z + +(i)T + +] + +i=1 + +E[z + +z + +] +−1 + +m + +x(i) µTz(i |x(i) +i=1 + +(i) (i)T + +i=1 + +m + += + +−1 + +m +(i) + +Σz(i) |x(i) + + +µz(i |x(i) µTz(i |x(i) + +i=1 + +All these terms were calcuated in the E step, so this is our final M step update. +(c) As σ 2 → 0, show that if the EM algorithm convergences to a parameter vector U ⋆ +(and such convergence is guarenteed by the argument presented in class), then U ⋆ +m +1 +(i) (i) T +— i.e., +must be an eigenvector of the sample covariance matrix Σ = m +i=1 x x +⋆ +U must satisfy +λU ⋆ = ΣU ⋆ . +[Hint: When σ 2 → 0, Σz|x → 0, so the E step only needs to compute the means +µz|x and not the variances. Let w ∈ Rm be a vector containing all these means, + + 6 + +CS229 Problem Set #4 Solutions + +wi = µz(i) |x(i) , and show that the E step and M step can be expressed as +w= + +XU +, +UT U + +U= + +XT w +wT w + +respectively. Finally, show that if U doesn’t change after this update, it must satisfy +the eigenvector equation shown above. ] +Answer: +For the E step, when σ 2 → 0, µz(i) |x(i) = +hint we have +XU +w= T +U U +as desired. + +U T x(i) +, +UT U + +so using w as defined in the + +As mentioned in the hint, when σ 2 → 0, Σz(i) |x(i) = 0, so +x(i) µTz(i |x(i) + += + +Σz(i) |x(i) + µz(i |x(i) µTz(i |x(i) +i=1 + +i=1 +m + +m + +x(i) wi + += + +−1 + +m + +m + +U + +i=1 + +wi wi )−1 = + +( +i=1 + +XT w +wT w + +For U to remain unchanged after an update requires that +U= + +X T UXU +TU +U T X T XU +UT U UT U + += X T XU + +UT U +U T X T XU + += X T XU + +1 +λ + +proving the desired equation. +3. PCA and ICA for Natural Images +In this problem we’ll apply Principal Component Analysis and Independent Component +Analysis to images patches collected from “natural” image scenes (pictures of leaves, grass, +etc). This is one of the classical applications of the ICA algorithm, and sparked a great +deal of interest in the algorithm; it was observed that the bases recovered by ICA closely +resemble image filters present in the first layer of the visual cortex. +The q3/ directory contains the data and several useful pieces of code for this problem. The +raw images are stored in the images/ subdirectory, though you will not need to work with +these directly, since we provide code for loading and normalizing the images. +Calling the function [X ica, X pca] = load images; will load the images, break them +into 16x16 images patches, and place all these patches into the columns of the matrices X ica and X pca. We create two different data sets for PCA and ICA because the +algorithms require slightly different methods of preprocessing the data.1 +For this problem you’ll implement the ica.m and pca.m functions, using the PCA and +ICA algorithms described in the class notes. While the PCA implementation should be +straightforward, getting a good implementation of ICA can be a bit trickier. Here is some +general advice to getting a good implementation on this data set: +1 Recall that the first step of performing PCA is to subtract the mean and normalize the variance of the features. +For the image data we’re using, the preprocessing step for the ICA algorithm is slightly different, though the +precise mechanism and justification is not imporant for the sake of this problem. Those who are curious about +the details should read Bell and Sejnowki’s paper “The ’Independent Components’ of Natural Scenes are Edge +Filters,” which provided the basis for the implementation we use in this problem. + + CS229 Problem Set #4 Solutions + +7 + +• Picking a good learning rate is important. In our experiments we used α = 0.0005 on +this data set. +• Batch gradient descent doesn’t work well for ICA (this has to do with the fact that +ICA objective function is not concave), but the pure stochastic gradient described in +the notes can be slow (There are about 20,000 16x16 images patches in the data set, +so one pass over the data using the stochastic gradient rule described in the notes +requires inverting the 256x256 W matrix 20,000 times). Instead, a good compromise +is to use a hybrid stochastic/batch gradient descent where we calculate the gradient +with respect to several examples at a time (100 worked well for us), and use this to +update W . Our implementation makes 10 total passes over the entire data set. +• It is a good idea to randomize the order of the examples presented to stochastic +gradient descent before each pass over the data. +• Vectorize your Matlab code as much as possible. For general examples of how to do +this, look at the Matlab review session. +For reference, computing the ICA W matrix for the entire set of image patches takes about +5 minutes on a 1.6 Ghz laptop using our implementation. +After you’ve learned the U matrix for PCA (the columns of U should contain the principal +components of the data) and the W matrix of ICA, you can plot the basis functions using +the plot ica bases(W); and plot pca bases(U); functions we have provide. Comment +briefly on the difference between the two sets of basis functions. +Answer: + +The following are our implementations of pca.m and ica.m: + +function U = pca(X) +[U,S,V] = svd(X*X’); +function W = ica(X) +[n,m] = size(X); +chunk = 100; +alpha = 0.0005; +W = eye(n); +for iter=1:10, +disp([num2str(iter)]); +X = X(:,randperm(m)); +for i=1:floor(m/chunk), +Xc = X(:,(i-1)*chunk+1:i*chunk); +dW = (1 - 2./(1+exp(-W*Xc)))*Xc’ + chunk*inv(W’); +W = W + alpha*dW; +end +end + +PCA produces the following bases: + + 8 + +CS229 Problem Set #4 Solutions + +while ICA produces the following bases + +The PCA bases capture global features of the images, while the ICA bases capture more local +features. +4. Convergence of Policy Iteration +In this problem we show that the Policy Iteration algorithm, described in the lecture notes, +is guarenteed to find the optimal policy for an MDP. First, define B π to be the Bellman +operator for policy π, defined as follows: if V ′ = B(V ), then +V ′ (s) = R(s) + γ + +Psπ(s) (s′ )V (s′ ). +s′ ∈S + + 9 + +CS229 Problem Set #4 Solutions + +(a) Prove that if V1 (s) ≤ V2 (s) for all s ∈ S, then B(V1 )(s) ≤ B(V2 )(s) for all s ∈ S. +Answer: +B(V1 )(s) + +Psπ(s) (s′ )V1 (s′ ) + += R(s) + γ +s′ ∈S + +≤ R(s) + γ + +Psπ(s) (s′ )V2 (s′ ) = B(V2 )(s) +s′ ∈S + +where the inequality holds because Psπ(s) (s′ ) ≥ 0. + +(b) Prove that for any V , + +B π (V ) − V π + +∞ + +≤γ V −Vπ + +∞ + +where V ∞ = maxs∈S |V (s)|. Intuitively, this means that applying the Bellman +operator B π to any value function V , brings that value function “closer” to the value +function for π, V π . This also means that applying B π repeatedly (an infinite number +of times) +B π (B π (. . . B π (V ) . . .)) +will result in the value function V π (a little bit more is needed to make this completely +formal, but we won’t worry about that here). +[Hint: Use the fact that for any α, x ∈ Rn , if i αi = 1 and αi ≥ 0, then i αi xi ≤ +maxi xi .] Answer: +B π (V ) − V π + +∞ + += + +max +R(s) + γ +′ +s ∈S + +s′ ∈S + += γ max +′ +s ∈∫ + +s′ ∈S + +≤ γ V −Vπ + +Psπ(s) (s′ )V (s′ ) − R(s) − γ + +Psπ(s) (s′ )V π (s′ ) +s′ ∈S + +Psπ(s) (s′ ) (V (s′ ) − V π (s′ )) +∞ + +where the inequality follows from the hint above. +(c) Now suppose that we have some policy π, and use Policy Iteration to choose a new +policy π ′ according to +Psa (s′ )V π (s′ ). + +π ′ (s) = arg max +a∈A + +s′ ∈S + +Show that this policy will never perform worse that the previous one — i.e., show +′ +that for all s ∈ S, V π (s) ≤ V π (s). +′ +[Hint: First show that V π (s) ≤ B π (V π )(s), then use the proceeding excercises to +π′ +π +π′ +show that B (V )(s) ≤ V (s).] +Answer: +V π (s) + +Psπ(s) (s′ )V π (s′ ) + += R(s) + γ +s′ ∈S + +≤ R(s) + γ max +a∈A + +Psa (s′ )V π (s′ ) +s′ ∈S +′ + +Psπ′ (s) (s′ )V π (s′ ) = B π (V π )(s) + += R(s) + γ +s′ ∈S + + 10 + +CS229 Problem Set #4 Solutions + +Applying part (a), +′ + +′ + +′ + +′ + +V π (s) ≤ B π (V π )(s) ⇒ B π (V π )(s) ≤ B π (B π (V π ))(s) +Continually applying this property, and applying part (b), we obtain +′ + +′ + +′ + +′ + +′ + +′ + +′ + +V π (s) ≤ B π (V π )(s) ≤ B π (B π (V π ))(s) ≤ . . . ≤ B π (B π (. . . B π (V π ) . . .))(s) = V π (s). +(d) Use the proceeding exercises to show that policy iteration will eventually converge +(i.e., produce a policy π ′ = π). Furthermore, show that it must converge to the +optimal policy π ⋆ . For the later part, you may use the property that if some value +function satisfies +s′ ∈ SPsa (s′ )V (s′ ) + +V (s) = R(s) + γ max +a∈A + +then V = V ⋆ . +Answer: We know that policy iteration must converge because there are only a finite +number of possible policies (if there are |S| states, each with |A| actions, then that leads +to a |S||A| total possible policies). Since the policies are monotonically improving, as we +showed in part (c), at some point we must stop generating new policies, so the algorithm +must produce π ′ = π. Using the assumptions stated in the question, it is easy to show +convergence to the optimal policy. If π ′ = π, then using the same logic as in part (c) +′ + +Psa (s′ )V π (s), + +V π (s) = V π (s) = R(s) + γ max +a∈A + +s′ ∈∫ + +So V = V ⋆ , and therefore π = π ⋆ . +5. Reinforcement Learning: The Mountain Car +In this problem you will implement the Q-Learning reinforcement learning algorithm described in class on a standard control domain known as the Mountain Car.2 The Mountain +Car domain simulates a car trying to drive up a hill, as shown in the figure below. + +0.6 + +0.4 + +0.2 + +0 + +−0.2 + +−0.4 + +−0.6 + +−1.2 + +2 The + +−1 + +−0.8 + +−0.6 + +−0.4 + +−0.2 + +0 + +0.2 + +dynamics of this domain were taken from Sutton and Barto, 1998. + +0.4 + +0.6 + + 11 + +CS229 Problem Set #4 Solutions + +All states except those at the top of the hill have a constant reward R(s) = −1, while the +goal state at the hilltop has reward R(s) = 0; thus an optimal agent will try to get to the +top of the hill as fast as possible (when the car reaches the top of the hill, the episode is +over, and the car is reset to its initial position). However, when starting at the bottom +of the hill, the car does not have enough power to reach the top by driving forward, so +it must first accerlaterate accelerate backwards, building up enough momentum to reach +the top of the hill. This strategy of moving away from the goal in order to reach the goal +makes the problem difficult for many classical control algorithms. +As discussed in class, Q-learning maintains a table of Q-values, Q(s, a), for each state and +action. These Q-values are useful because, in order to select an action in state s, we only +need to check to see which Q-value is greatest. That is, in state s we take the action +arg max Q(s, a). +a∈A + +The Q-learning algorithm adjusts its estimates of the Q-values as follows. If an agent is in +state s, takes action a, then ends up in state s′ , Q-learning will update Q(s, a) by +Q(s, a) = (1 − α)Q(s, a) + γ(R(s′ ) + γ max +Q(s′ , a′ ). +′ +a ∈A + +At each time, your implementation of Q-learning can execute the greedy policy π(s) = +arg maxa∈A Q(s, a) +Implement the [q, steps per episode] = qlearning(episodes) function in the q5/ +directory. As input, the function takes the total number of episodes (each episode starts +with the car at the bottom of the hill, and lasts until the car reaches the top), and outputs +a matrix of the Q-values and a vector indicating how many steps it took before the car was +able to reach the top of the hill. You should use the [x, s, absorb] = mountain car(x, +actions(a)) function to simulate one control cycle for the task — the x variable describes +the true (continuous) state of the system, whereas the s variable describes the discrete +index of the state, which you’ll use to build the Q values. +Plot a graph showing the average number of steps before the car reaches the top of the +hill versus the episode number (there is quite a bit of variation in this quantity, so you will +probably want to average these over a large number of episodes, as this will give you a +better idea of how the number of steps before reaching the hilltop is decreasing). You can +also visualize your resulting controller by calling the draw mountain car(q) function. +Answer: + +The following is our implementation of qlearning.m: + +function [q, steps_per_episode] = qlearning(episodes) +% set up parameters and initialize q values +alpha = 0.05; +gamma = 0.99; +num_states = 100; +num_actions = 2; +actions = [-1, 1]; +q = zeros(num_states, num_actions); +for i=1:episodes, + + 12 + +CS229 Problem Set #4 Solutions + +[x, s, absorb] = mountain_car([0.0 -pi/6], 0); +[maxq, a] = max(q(s,:)); +if (q(s,1) == q(s,2)) a = ceil(rand*num_actions); end; +steps = 0; +while (~absorb) +% execute the best action or a random action +[x, sn, absorb] = mountain_car(x, actions(a)); +reward = -double(absorb == 0); +% find the best action for the next state and update q value +[maxq, an] = max(q(sn,:)); +if (q(sn,1) == q(sn,2)) an = ceil(rand*num_actions); end +q(s,a) = (1 - alpha)*q(s,a) + alpha*(reward + gamma*maxq); +a = an; +s = sn; +steps = steps + 1; +end +steps_per_episode(i) = steps; +end +Within 10000 episodes, the algorithm converges to a policy that usually gets the car up the hill +in around 52-53 steps. The following plot shows the number of steps per episode (averaged +over 500 episodes) versus the number of episodes. We generated the plot using the following +code: +for i=1:10, +[q, ep_steps] = qlearning(10000); +all_ep_steps(i,:) = ep_steps; +end +plot(mean(reshape(mean(all_ep_steps), 500, 20))); + +250 + +Average Steps per Episode + +200 + +150 + +100 + +50 + +0 + +1000 + +2000 + +3000 + +4000 5000 6000 +Episode Number + +7000 + +8000 + +9000 10000 + + \ No newline at end of file diff --git a/Lectures/aimlcs229/schedule.txt b/Lectures/aimlcs229/schedule.txt new file mode 100644 index 0000000..6ee1c79 --- /dev/null +++ b/Lectures/aimlcs229/schedule.txt @@ -0,0 +1,55 @@ +CS 229 +Machine Learning +Handout #2: Tentative Course Schedule +Syllabus +z + +Introduction (1 class) Basic concepts. + +z + +Supervised learning. (6 classes) Supervised learning setup. LMS. +Logistic regression. Perceptron. Exponential family. +Generative learning algorithms. Gaussian discriminant analysis. Naive Bayes. +Support vector machines. +Model selection and feature selection. +Ensemble methods: Bagging, boosting, ECOC. + +z + +Learning theory. (3 classes) Bias/variance tradeoff. Union and Chernoff/Hoeffding +bounds. +VC dimension. Worst case (online) learning. +Advice on using learning algorithms. + +z + +Unsupervised learning. (5 classes) Clustering. K-means. +EM. Mixture of Gaussians. +Factor analysis. +PCA. MDS. pPCA. +Independent components analysis (ICA). + +z + +Reinforcement learning and control. (4 classes) MDPs. Bellman equations. +Value iteration. Policy iteration. +Linear quadratic regulation (LQR). LQG. +Q-learning. Value function approximation. +Policy search. Reinforce. POMDPs. + +Dates for assignments +z +z +z +z +z + +Assignment 1: Out 10/3. Due 10/17. +Assignment 2: Out 10/17. Due 10/31. +Assignment 3: Out 10/31. Due 11/14. +Assignment 4: Out 11/14. Due 12/3. +Term project: Proposals due 10/19. Milestone due 11/16. Poster presentations on 12/12; +final writeup due on 12/14 (no late days). + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture1.txt b/Lectures/mlclass/Lecture1.txt new file mode 100644 index 0000000..f765a9e --- /dev/null +++ b/Lectures/mlclass/Lecture1.txt @@ -0,0 +1,263 @@ +Introduc)on   + +Welcome   +Machine  Learning   + + Andrew  Ng   + + SPAM +Andrew  Ng   + + Machine  Learning   +-­‐   Grew  out  of  work  in  AI   +-­‐   New  capability  for  computers     +   + +Examples:     +-­‐   Database  mining     +Large  datasets  from  growth  of  automa)on/web.       +E.g.,  Web  click  data,  medical  records,  biology,  engineering   +-­‐   Applica)ons  can’t  program  by  hand.   +E.g.,  Autonomous  helicopter,  handwri)ng  recogni)on,  most  of   +Natural  Language  Processing  (NLP),  Computer  Vision.     +   +Andrew  Ng   + + Machine  Learning   +-­‐   Grew  out  of  work  in  AI   +-­‐   New  capability  for  computers     +   + +Examples:     +-­‐   Database  mining     +Large  datasets  from  growth  of  automa)on/web.       +E.g.,  Web  click  data,  medical  records,  biology,  engineering   +-­‐   Applica)ons  can’t  program  by  hand.   +E.g.,  Autonomous  helicopter,  handwri)ng  recogni)on,  most  of   +Natural  Language  Processing  (NLP),  Computer  Vision.     +-­‐   Self-­‐customizing  programs   +E.g.,  Amazon,  NeOlix  product  recommenda)ons   +-­‐   Understanding  human  learning  (brain,  real  AI).   + +Andrew  Ng   + + Introduc)on   +What  is  machine   +learning   +Machine  Learning   +Andrew  Ng   + + Machine  Learning  defini)on   +•  Arthur  Samuel  (1959).  Machine  Learning:  Field  of   +study  that  gives  computers  the  ability  to  learn   +without  being  explicitly  programmed.     +•  Tom  Mitchell  (1998)  Well-­‐posed  Learning   +Problem:  A  computer  program  is  said  to  learn   +from  experience  E  with  respect  to  some  task  T   +and  some  performance  measure  P,  if  its   +performance  on  T,  as  measured  by  P,  improves   +with  experience  E.     +Andrew  Ng   + + “A  computer  program  is  said  to  learn  from  experience  E  with  respect  to   +some  task  T  and  some  performance  measure  P,  if  its  performance  on  T,   +as  measured  by  P,  improves  with  experience  E.”   + +Suppose  your  email  program  watches  which  emails  you  do  or  do   +not  mark  as  spam,  and  based  on  that  learns  how  to  be\er  filter   +spam.    What  is  the  task  T  in  this  se]ng?     +   +Classifying  emails  as  spam  or  not  spam.     +Watching  you  label  emails  as  spam  or  not  spam.     +The  number  (or  frac)on)  of  emails  correctly  classified  as  spam/not  spam.     +None  of  the  above—this  is  not  a  machine  learning  problem.   + + Machine  learning  algorithms:   +-­‐  Supervised  learning   +-­‐  Unsupervised  learning   +Others:  Reinforcement  learning,  recommender   +systems.     +   + +Also  talk  about:  Prac)cal  advice  for  applying   +learning  algorithms.     +Andrew  Ng   + + Introduc)on   + +Supervised   +Learning   +Machine  Learning   +Andrew  Ng   + + Housing  price  predic)on.     +400   +300   + +Price  ($)     +200   +in  1000’s   +100   +0   +0   + +500   + +1000   + +1500   + +2000   + +2500   + +Size  in  feet2     + +Supervised  Learning   +“right  answers”  given   + +Regression:  Predict  con)nuous   +valued  output  (price)   +Andrew  Ng   + + Breast  cancer  (malignant,  benign)   +Classifica)on   +Discrete  valued   +output  (0  or  1)   + +1(Y)   + +Malignant?   +0(N)   + +Tumor  Size   + +Tumor  Size   +Andrew  Ng   + + -­‐  Clump  Thickness   +-­‐  Uniformity  of  Cell  Size   +-­‐  Uniformity  of  Cell  Shape   +…   + +Age   + +Tumor  Size   + +Andrew  Ng   + + You’re  running  a  company,  and  you  want  to  develop  learning  algorithms  to  address   +each  of  two  problems.   +   +Problem  1:  You  have  a  large  inventory  of  iden)cal  items.    You  want  to  predict  how   +many  of  these  items  will  sell  over  the  next  3  months.   +Problem  2:  You’d  like  soYware  to  examine  individual  customer  accounts,  and  for  each   +account  decide  if  it  has  been  hacked/compromised.     +   +Should  you  treat  these  as  classifica)on  or  as  regression  problems?     +Treat  both  as  classifica)on  problems.     +Treat  problem  1  as  a  classifica)on  problem,  problem  2  as  a  regression  problem.     +Treat  problem  1  as  a  regression  problem,  problem  2  as  a  classifica)on  problem.     +Treat  both  as  regression  problems.     + + Introduc)on   + +Unsupervised   +Learning   +Machine  Learning   +Andrew  Ng   + + Supervised  Learning   + +x2   + +x1   +Andrew  Ng   + + Unsupervised  Learning   + +x2   + +x1   +Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Genes   + +Individuals   + +[Source:  Su-­‐In  Lee,  Dana  Pe’er,  Aimee  Dudley,  George  Church,  Daphne  Koller]   + +Andrew  Ng   + + Organize  compu)ng  clusters   + +Social  network  analysis   + +Image  credit:  NASA/JPL-­‐Caltech/E.  Churchwell  (Univ.  of  Wisconsin,  Madison)     + +Market  segmenta)on   + +Astronomical  data  analysis   + +Andrew  Ng   + + Cocktail  party  problem   + +Speaker  #1   + +Speaker  #2   + +Microphone  #1   + +Microphone  #2   +Andrew  Ng   + + Microphone  #1:   +   +Microphone  #2:     + +Output  #1:   +   +Output  #2:     + +Microphone  #1:   +   +Microphone  #2:     + +Output  #1:   +   +Output  #2:     + +[Audio  clips  courtesy  of  Te-­‐Won  Lee.]   + +Andrew  Ng   + + Cocktail  party  problem  algorithm   +   +[W,s,v]  =  svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x');   + +[Source:  Sam  Roweis,  Yair  Weiss  &  Eero  Simoncelli]   + +Andrew  Ng   + + Of  the  following  examples,  which  would  you  address  using  an   +unsupervised  learning  algorithm?    (Check  all  that  apply.)     +   +Given  email  labeled  as  spam/not  spam,  learn  a  spam  filter.   + +Given  a  set  of  news  ar)cles  found  on  the  web,  group  them  into   +set  of  ar)cles  about  the  same  story.     +Given  a  database  of  customer  data,  automa)cally  discover  market   +segments  and  group  customers  into  different  market  segments.     +Given  a  dataset  of  pa)ents  diagnosed  as  either  having  diabetes  or   +not,  learn  to  classify  new  pa)ents  as  having  diabetes  or  not.     + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture10.txt b/Lectures/mlclass/Lecture10.txt new file mode 100644 index 0000000..fbf947c --- /dev/null +++ b/Lectures/mlclass/Lecture10.txt @@ -0,0 +1,417 @@ +Advice  for  applying   +machine  learning   + +Deciding  what   +to  try  next   +Machine  Learning   + + Debugging  a  learning  algorithm:   +Suppose  you  have  implemented  regularized  linear  regression  to  predict  housing   +prices.   +   +   +   +However,  when  you  test  your  hypothesis  on  a  new  set  of  houses,  you  find  that   +it  makes  unacceptably  large  errors  in  its  predicDons.  What  should  you  try  next?     +-­‐  +-­‐  +-­‐  +-­‐  +-­‐  +-­‐  + +Get  more  training  examples   +Try  smaller  sets  of  features   +Try  geJng  addiDonal  features   +Try  adding  polynomial  features   +Try  decreasing   +Try  increasing   + +Andrew  Ng   + + Machine  learning  diagnos5c:   +DiagnosDc:  A  test  that  you  can  run  to  gain  insight  what   +is/isn’t  working  with  a  learning  algorithm,  and  gain   +guidance  as  to  how  best  to  improve  its  performance.   +DiagnosDcs  can  take  Dme  to  implement,  but  doing  so   +can  be  a  very  good  use  of  your  Dme.   + +Andrew  Ng   + + Advice  for  applying   +machine  learning   + +EvaluaDng  a   +hypothesis   +Machine  Learning   + + Evalua5ng  your  hypothesis   +price   + +Fails  to  generalize  to  new   +examples  not  in  training  set.   + +size   + +size  of  house   +no.  of  bedrooms   +no.  of  floors   +age  of  house   +average  income  in  neighborhood   +kitchen  size   + +Andrew  Ng   + + Evalua5ng  your  hypothesis   +Dataset:   +Size   + +Price   + +2104   +1600   +2400   +1416   +3000   +1985   +1534   +1427   +1380   +1494   + +400   +330   +369   +232   +540   +300   +315   +199   +212   +243   + +Andrew  Ng   + + Training/tes5ng  procedure  for  linear  regression   +-­‐  Learn  parameter          from  training  data  (minimizing   +training  error                    )   +-­‐  Compute  test  set  error:   +   + +Andrew  Ng   + + Training/tes5ng  procedure  for  logis5c  regression   +-­‐  Learn  parameter          from  training  data   +-­‐  Compute  test  set  error:   +   +-­‐  MisclassificaDon  error  (0/1  misclassificaDon  error):   +   + +Andrew  Ng   + + Advice  for  applying   +machine  learning   +Model  selecDon  and   +training/validaDon/test   +sets   +Machine  Learning   + + price   + +OverfiBng  example   + +size   + +Once  parameters   +were  fit  to  some  set  of  data   +(training  set),  the  error  of  the   +parameters  as  measured  on   +that  data  (the  training  error                         +xxxxx)  is  likely  to  be  lower   +than  the  actual  generalizaDon   +error.   + +Andrew  Ng   + + Model  selec5on   +1.   +2.   +3.   +   +10.   +Choose   +How  well  does  the  model  generalize?  Report  test  set   +error                                          .   +Problem:                                            is  likely  to  be  an  opDmisDc  esDmate  of   +generalizaDon  error.  I.e.  our  extra  parameter  (        =  degree  of   +polynomial)  is  fit  to  test  set.   + +Andrew  Ng   + + Evalua5ng  your  hypothesis   +Dataset:   +Size   + +Price   + +2104   +1600   +2400   +1416   +3000   +1985   +1534   +1427   +1380   +1494   + +400   +330   +369   +232   +540   +300   +315   +199   +212   +243   + +Andrew  Ng   + + Train/valida5on/test  error   +Training  error:   + +Cross  ValidaDon  error:   + +Test  error:   + +Andrew  Ng   + + Model  selec5on   +1.   +2.   +3.   +   +10.   +Pick   +EsDmate  generalizaDon  error    for  test  set   + +Andrew  Ng   + + Advice  for  applying   +machine  learning   + +Diagnosing  bias  vs.   +variance   +Machine  Learning   + + Size   + +High  bias   +(underfit)   + +Price   + +Price   + +Price   + +Bias/variance   + +Size   + +“Just  right”   + +Size   + +High  variance   +(overfit)   + +Andrew  Ng   + + Bias/variance   +Training  error:   + +error   + +Cross  validaDon  error:   + +degree  of  polynomial  d   + +Andrew  Ng   + + Diagnosing  bias  vs.  variance   + +error   + +Suppose  your  learning  algorithm  is  performing  less  well  than   +you  were  hoping.  (                          or                                    is  high.)    Is  it  a  bias   +problem  or  a  variance  problem?   +Bias  (underfit):   +(cross  validaDon     +error)   + +Variance  (overfit):   +(training  error)   + +degree  of  polynomial  d   + +Andrew  Ng   + + Advice  for  applying   +machine  learning   + +RegularizaDon  and   +bias/variance   +Machine  Learning   + + Linear  regression  with  regulariza5on   + +Price   + +Price   + +Price   + +Model:   + +Size   + +Size   + +Large  xx   +High  bias  (underfit)   + +Intermediate  xx   +“Just  right”   + +Size   + +Small  xx   +High  variance  (overfit)   + +Andrew  Ng   + + Choosing  the  regulariza5on  parameter     + +Andrew  Ng   + + Choosing  the  regulariza5on  parameter     +Model:   +1.  Try   +2.  Try   +3.  Try   +4.  Try   +5.  Try   +   +12. Try   +Pick  (say)              .    Test  error:   + +Andrew  Ng   + + Bias/variance  as  a  func5on  of  the  regulariza5on  parameter   + +Andrew  Ng   + + Advice  for  applying   +machine  learning   + +Learning  curves   +Machine  Learning   + + error   + +Learning  curves   + +(training  set  size)   + +Andrew  Ng   + + error   + +price   + +High  bias   + +(training  set  size)   + +If  a  learning  algorithm  is  suffering   +from   high   bias,   geJng   more   +training   data   will   not   (by   itself)   +help  much.   + +price   + +size   + +size   + +Andrew  Ng   + + High  variance   +error   + +price   + +(and  small          )   + +(training  set  size)   + +If  a  learning  algorithm  is  suffering   +from   high   variance,   geJng   more   +training  data  is  likely  to  help.   + +price   + +size   + +size   + +Andrew  Ng   + + Advice  for  applying   +machine  learning   + +Deciding  what  to   +try  next  (revisited)   +Machine  Learning   + + Debugging  a  learning  algorithm:   +Suppose  you  have  implemented  regularized  linear  regression  to  predict   +housing  prices.  However,  when  you  test  your  hypothesis  in  a  new  set   +of  houses,  you  find  that  it  makes  unacceptably  large  errors  in  its   +predicDon.  What  should  you  try  next?     +-­‐  +-­‐  +-­‐  +-­‐  +-­‐  +-­‐  + +Get  more  training  examples   +Try  smaller  sets  of  features   +Try  geJng  addiDonal  features   +Try  adding  polynomial  features   +Try  decreasing   +Try  increasing   + +Andrew  Ng   + + Neural  networks  and  overfiBng   +“Small”  neural  network   +(fewer  parameters;  more   +prone  to  underfiJng)   + +ComputaDonally  cheaper   + +“Large”  neural  network   +(more  parameters;  more  prone   +to  overfiJng)   + +ComputaDonally  more  expensive.   +   +Use  regularizaDon  (      )  to  address  overfiJng.   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture11.txt b/Lectures/mlclass/Lecture11.txt new file mode 100644 index 0000000..83302eb --- /dev/null +++ b/Lectures/mlclass/Lecture11.txt @@ -0,0 +1,281 @@ +Machine  learning   +system  design   + +Machine  Learning   + +Priori3zing  what  to   +work  on:  Spam   +classifica3on  example   + + Building  a  spam  classifier   + +From: cheapsales@buystufffromme.com +To: ang@cs.stanford.edu +Subject: Buy now! + +From: Alfred Ng +To: ang@cs.stanford.edu +Subject: Christmas dates? + +Deal of the week! Buy now! +Rolex w4tchs - $100 +Med1cine (any kind) - $50 +Also low cost M0rgages +available. + +Hey Andrew, +Was talking to Mom about plans +for Xmas. When do you get off +work. Meet Dec 22? +Alf + +Andrew  Ng   + + Building  a  spam  classifier   +Supervised  learning.                      features  of  email.                  spam  (1)  or  not  spam  (0).   +Features        :  Choose  100  words  indica3ve  of  spam/not  spam.   +   + +From: cheapsales@buystufffromme.com +To: ang@cs.stanford.edu +Subject: Buy now! + +Deal of the week! Buy now! + +Note:  In  prac3ce,  take  most  frequently  occurring            words  (  10,000  to  50,000)   +in  training  set,  rather  than  manually  pick  100  words.   + +Andrew  Ng   + + Building  a  spam  classifier   +How  to  spend  your  3me  to  make  it  have  low  error?   +-­‐  Collect  lots  of  data   +-­‐  E.g.  “honeypot”  project.   +-­‐  Develop   sophis3cated   features   based   on   email   rou3ng   +informa3on  (from  email  header).   +-­‐  Develop   sophis3cated   features   for   message   body,   e.g.   should   +“discount”  and  “discounts”  be  treated  as  the  same  word?  How   +about  “deal”  and  “Dealer”?  Features  about  punctua3on?   +-­‐  Develop   sophis3cated   algorithm   to   detect   misspellings   (e.g.   +m0rtgage,  med1cine,  w4tches.)   + +Andrew  Ng   + + Machine  learning   +system  design   + +Error  analysis   +Machine  Learning   + + Recommended  approach   +-­‐  Start  with  a  simple  algorithm  that  you  can  implement  quickly.   +Implement  it  and  test  it  on  your  cross-­‐valida3on  data.   +-­‐  Plot  learning  curves  to  decide  if  more  data,  more  features,  etc.   +are  likely  to  help.   +-­‐  Error  analysis:    Manually  examine  the  examples  (in  cross   +valida3on  set)  that  your  algorithm  made  errors  on.  See  if  you   +spot  any  systema3c  trend  in  what  type  of  examples  it  is   +making  errors  on.   + +Andrew  Ng   + + Error  Analysis   +                            500  examples  in  cross  valida3on  set   +Algorithm  misclassifies  100  emails.   +Manually  examine  the  100  errors,  and  categorize  them  based  on:   +(i)  What  type  of  email  it  is   +(ii)  What   cues   (features)   you   think   would   have   helped   the   +algorithm  classify  them  correctly.   +Pharma:   +Replica/fake:   +Steal  passwords:   +Other:   + +Deliberate  misspellings:   +  (m0rgage,  med1cine,  etc.)   +Unusual  email  rou3ng:   +Unusual  (spamming)  punctua3on:   + +Andrew  Ng   + + The  importance  of  numerical  evalua;on   +Should  discount/discounts/discounted/discoun3ng  be  treated  as  the   +same  word?     +Can  use  “stemming”  so\ware  (E.g.  “Porter  stemmer”)   +  universe/university.   +Error  analysis  may  not  be  helpful  for  deciding  if  this  is  likely  to  improve   +performance.  Only  solu3on  is  to  try  it  and  see  if  it  works.   +Need  numerical  evalua3on  (e.g.,  cross  valida3on  error)  of  algorithm’s   +performance  with  and  without  stemming.   +  Without  stemming:     +  With  stemming:   +Dis3nguish  upper  vs.  lower  case  (Mom/mom):   + +Andrew  Ng   + + Machine  learning   +system  design   + +Error  metrics  for   +skewed  classes   +Machine  Learning   + + Cancer  classifica;on  example   +Train  logis3c  regression  model                    .  (                  if  cancer,                   +otherwise)   +Find  that  you  got  1%  error  on  test  set.   +(99%  correct  diagnoses)   +   +Only  0.50%  of  pa3ents  have  cancer.   +function y = predictCancer(x) +y = 0; %ignore x! +return + +Andrew  Ng   + + Precision/Recall   +                      in  presence  of  rare  class  that  we  want  to  detect   +Precision     +(Of  all  pa3ents  where  we  predicted                      ,  what   +frac3on  actually  has  cancer?)   + +Recall   +(Of  all  pa3ents  that  actually  have  cancer,  what  frac3on   +did  we  correctly  detect  as  having  cancer?)   + +Andrew  Ng   + + Machine  learning   +system  design   +Trading  off  precision   +and  recall   +Machine  Learning   + + recall          =   + +Precision   + +Trading  off  precision  and  recall   +Logis3c  regression:   +Predict  1  if     +Predict  0  if     +Suppose  we  want  to  predict                        (cancer)   +only  if  very  confident.   + +precision        =   + +Suppose  we  want  to  avoid  missing  too  many   +cases  of  cancer  (avoid  false  nega3ves).   + +true  posi3ves   +no.  of  predicted  posi3ve   +true  posi3ves   +no.  of  actual  posi3ve   + +1   + +0.5   + +0.5   + +1   + +Recall   + +More  generally:  Predict  1  if                                threshold.   + +Andrew  Ng   + + F1  Score  (F  score)   +How  to  compare  precision/recall  numbers?   +Precision(P)   + +Recall  (R)   + +Average   + +F1  Score   + +Algorithm  1   + +0.5   + +0.4   + +0.45   + +0.444   + +Algorithm  2   + +0.7   + +0.1   + +0.4   + +0.175   + +Algorithm  3   + +0.02   + +1.0   + +0.51   + +0.0392   + +Average:   +F1  Score:     + +Andrew  Ng   + + Machine  learning   +system  design   + +Data  for  machine   +learning   +Machine  Learning   + + Designing  a  high  accuracy  learning  system   + +          Accuracy             + +E.g.    Classify  between  confusable  words.   +  {to,  two,  too},    {then,  than}   +For  breakfast  I  ate  _____  eggs.   +Algorithms   +-­‐  Perceptron  (Logis3c  regression)   +-­‐  Winnow   +-­‐  Memory-­‐based   +-­‐  Naïve  Bayes   +Training  set  size  (millions)   +   +“It’s  not  who  has  the  best  algorithm  that  wins.     +   +   +   +  It’s  who  has  the  most  data.”   +[Banko  and  Brill,  2001]   + + Large  data  ra;onale   +Assume  feature                                  has  sufficient  informa3on  to   +predict          accurately.     +Example:  For  breakfast  I  ate  _____  eggs.   +Counterexample:  Predict  housing  price  from  only  size   +(feet2)  and  no  other  features.     +Useful  test:  Given  the  input        ,  can  a  human  expert   +confidently  predict      ?   + + Large  data  ra;onale   +Use  a  learning  algorithm  with  many  parameters  (e.g.  logis3c   +regression/linear  regression  with  many  features;  neural  network   +with  many  hidden  units).     +   +   +   + +Use  a  very  large  training  set  (unlikely  to  overfit)   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture12.txt b/Lectures/mlclass/Lecture12.txt new file mode 100644 index 0000000..a297877 --- /dev/null +++ b/Lectures/mlclass/Lecture12.txt @@ -0,0 +1,275 @@ +Support  Vector   +Machines   +Op2miza2on   +objec2ve   +Machine  Learning   + + Alterna(ve  view  of  logis(c  regression   + +If                    ,  we  want                                ,   +If                    ,  we  want                                ,   + +Andrew  Ng   + + Alterna(ve  view  of  logis(c  regression   +Cost  of  example:   + +If                      (want                              ):   + +If                      (want                              ):   + +Andrew  Ng   + + Support  vector  machine   +Logis2c  regression:   + +Support  vector  machine:   + +Andrew  Ng   + + SVM  hypothesis   + +Hypothesis:   + +Andrew  Ng   + + Support  Vector   +Machines   +Large  Margin   +Intui2on   +Machine  Learning   + + Support  Vector  Machine   + +-­‐1   + +1   + +-­‐1   + +1   + +If                    ,  we  want                                  (not  just              )   +If                    ,  we  want                                    (not  just              )   + +Andrew  Ng   + + SVM  Decision  Boundary   + +Whenever                                :   + +-­‐1   + +1   + +-­‐1   + +1   + +Whenever                                :   + +Andrew  Ng   + + SVM  Decision  Boundary:  Linearly  separable  case   + +x2   + +x1   + +Large  margin  classifier   + +Andrew  Ng   + + Large  margin  classifier  in  presence  of  outliers   + +x2   + +x1   + +Andrew  Ng   + + Support  Vector   +Machines   + +Machine  Learning   + +The  mathema2cs   +behind  large  margin   +classifica2on  (op2onal)   + + Vector  Inner  Product   + +Andrew  Ng   + + SVM  Decision  Boundary   + +Andrew  Ng   + + SVM  Decision  Boundary   + +Andrew  Ng   + + Support  Vector   +Machines   + +Kernels  I   +Machine  Learning   + + Non-­‐linear  Decision  Boundary   + +x2   + +x1   + +Is  there  a  different  /  beRer  choice  of  the  features                                                        ?   + +Andrew  Ng   + + Kernel   + +Given          ,  compute  new  feature  depending     +on  proximity  to  landmarks     + +x2   + +x1   + +Andrew  Ng   + + Kernels  and  Similarity   + +Andrew  Ng   + + Example:   + +Andrew  Ng   + + x2   + +x1   + +Andrew  Ng   + + Support  Vector   +Machines   + +Kernels  II   +Machine  Learning   + + Choosing  the  landmarks   +Given        :   +   +   + +x2   + +x1   + +Predict                          if   +Where  to  get                                                                ?   + +Andrew  Ng   + + SVM  with  Kernels   +Given   +choose   +Given  example          :   +   +   +For  training  example                                        :     + +Andrew  Ng   + + SVM  with  Kernels   +Hypothesis:  Given        ,  compute  features   +Predict  “y=1”  if   +Training:   + +Andrew  Ng   + +     + +SVM  parameters:   +C  (                  ).        Large  C:  Lower  bias,  high  variance.   +                                        Small  C:  Higher  bias,  low  variance.   +                  Large          :  Features          vary  more  smoothly.   +   +  Higher  bias,  lower  variance.   + +                                    Small          :  Features          vary  less  smoothly.   +   +  Lower  bias,  higher  variance.   + +Andrew  Ng   + + Support  Vector   +Machines   + +Using  an  SVM   +Machine  Learning   + + Use  SVM  so]ware  package  (e.g.  liblinear,  libsvm,  …)  to  solve  for   +parameters        .   +Need  to  specify:   +Choice  of  parameter  C.   +Choice  of  kernel  (similarity  func2on):   +E.g.  No  kernel  (“linear  kernel”)   +Predict  “y  =  1”  if     +Gaussian  kernel:   +   +                                                                                          ,  where                                            .     +Need  to  choose            .   + +Andrew  Ng   + + Kernel  (similarity)  func(ons:   +function f = kernel(x1,x2) +x1 + +x2   + +return + +Note:  Do  perform  feature  scaling  before  using  the  Gaussian  kernel.   + +Andrew  Ng   + + Other  choices  of  kernel   +Note:  Not  all  similarity  func2ons                                                              make  valid  kernels.   +(Need  to  sa2sfy  technical  condi2on  called  “Mercer’s  Theorem”  to  make   +sure  SVM  packages’  op2miza2ons  run  correctly,  and  do  not  diverge).   +Many  off-­‐the-­‐shelf  kernels  available:   +-­‐  Polynomial  kernel:   + +   + +-­‐  More  esoteric:  String  kernel,  chi-­‐square  kernel,  histogram   +intersec2on  kernel,  …   + +Andrew  Ng   + + Mul(-­‐class  classifica(on   + +Many  SVM  packages  already  have  built-­‐in  mul2-­‐class  classifica2on   +func2onality.   +Otherwise,  use  one-­‐vs.-­‐all  method.  (Train              SVMs,  one  to  dis2nguish   +                          from  the  rest,  for                                                          ),  get     +Pick  class        with  largest   + +Andrew  Ng   + + Logis(c  regression  vs.  SVMs   +                number  of  features  (                                        ),                        number  of  training  examples   +If          is  large  (rela2ve  to          ):   +Use  logis2c  regression,  or  SVM  without  a  kernel  (“linear  kernel”)   +If          is  small,              is  intermediate:   +Use  SVM  with  Gaussian  kernel   +If          is  small,            is  large:   +Create/add  more  features,  then  use  logis2c  regression  or  SVM   +without  a  kernel   +Neural  network  likely  to  work  well  for  most  of  these  secngs,  but  may  be   +slower  to  train.   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture13.txt b/Lectures/mlclass/Lecture13.txt new file mode 100644 index 0000000..fc131f5 --- /dev/null +++ b/Lectures/mlclass/Lecture13.txt @@ -0,0 +1,265 @@ +Clustering   +Unsupervised  learning   +introduc3on   +Machine  Learning   + + Supervised  learning   + +Training  set:   + +   + +   + +   + +   + +   + +             + +Andrew  Ng   + + Unsupervised  learning   + +Training  set:   + +   + +     + +Andrew  Ng   + + Applica2ons  of  clustering   + +Market  segmenta3on   + +Social  network  analysis   + +Image  credit:  NASA/JPL-­‐Caltech/E.  Churchwell  (Univ.  of  Wisconsin,  Madison)     + +Organize  compu3ng  clusters   + +Astronomical  data  analysis   + +Andrew  Ng   + + Clustering   +Machine  Learning   + +K-­‐means     +algorithm   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + Andrew  Ng   + + K-­‐means  algorithm   +Input:   +-­‐         (number  of  clusters)   +-­‐  Training  set   +   +   +                                        (drop                          conven3on)   + +Andrew  Ng   + + K-­‐means  algorithm   +Randomly  ini3alize            cluster  centroids   +Repeat  {   +  for      =  1  to     +   +  :=  index  (from  1  to          )  of  cluster  centroid     +   +          closest  to     +  for        =  1  to     +   +  :=  average  (mean)  of  points  assigned  to  cluster   +     +   +  }   + +Andrew  Ng   + + K-­‐means  for  non-­‐separated  clusters   + +Weight   + +T-­‐shirt  sizing   + +Height   + +Andrew  Ng   + + Clustering   +Op3miza3on   +objec3ve   +Machine  Learning   + + K-­‐means  op2miza2on  objec2ve   +=  index  of  cluster  (1,2,…,      )  to  which  example                    is  currently   +assigned   +=  cluster  centroid          (                            )   +=  cluster  centroid  of  cluster  to  which  example                    has  been   +assigned   +Op3miza3on  objec3ve:   +   + +Andrew  Ng   + + K-­‐means  algorithm   +Randomly  ini3alize            cluster  centroids   + +Repeat  {   +  for      =  1  to     +   +  :=  index  (from  1  to          )  of  cluster  centroid     +   +          closest  to     +  for        =  1  to     +   +  :=  average  (mean)  of  points  assigned  to  cluster   +  }   + +Andrew  Ng   + + Clustering   +Random   +ini3aliza3on   +Machine  Learning   + + K-­‐means  algorithm   +Randomly  ini3alize            cluster  centroids   + +Repeat  {   +  for      =  1  to     +   +  :=  index  (from  1  to          )  of  cluster  centroid     +   +          closest  to     +  for        =  1  to     +   +  :=  average  (mean)  of  points  assigned  to  cluster   +  }   + +Andrew  Ng   + + Random  ini2aliza2on   +Should  have       +   +Randomly  pick          training     +examples.   +   +Set                                          equal  to  these     +        examples.   + +Andrew  Ng   + + Local  op2ma   + +Andrew  Ng   + + Random  ini2aliza2on   +For  i  =  1  to  100  {   +     +  Randomly  ini3alize  K-­‐means.   +  Run  K-­‐means.  Get                                                                                                  .   +  Compute  cost  func3on  (distor3on)     +   + +  }   + +Pick  clustering  that  gave  lowest  cost   + +Andrew  Ng   + + Clustering   +Choosing  the   +number  of  clusters   +Machine  Learning   + + What  is  the  right  value  of  K?   + +Andrew  Ng   + + Choosing  the  value  of  K   + +Cost  func3on     + +Cost  func3on     + +Elbow  method:   + +1   + +2   + +3   + +4   + +5   + +6   + +(no.  of  clusters)   + +7   + +8   + +1   + +2   + +3   + +4   + +5   + +6   + +7   + +8   + +(no.  of  clusters)   + +Andrew  Ng   + + Weight   + +Weight   + +Choosing  the  value  of  K   +Some3mes,  you’re  running  K-­‐means  to  get  clusters  to  use  for  some   +later/downstream  purpose.  Evaluate  K-­‐means  based  on  a  metric  for   +how  well  it  performs  for  that  later  purpose.   +   +E.g.   +T-­‐shirt  sizing   +T-­‐shirt  sizing   + +Height   + +Height   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture14.txt b/Lectures/mlclass/Lecture14.txt new file mode 100644 index 0000000..8e2bb8e --- /dev/null +++ b/Lectures/mlclass/Lecture14.txt @@ -0,0 +1,339 @@ +Dimensionality   +Reduc1on   +Mo1va1on  I:     +Data  Compression   +Machine  Learning   + + Data  Compression   +(inches)   + +Reduce  data  from     +2D  to  1D   + +(cm)   + +Andrew  Ng   + + Data  Compression   +(inches)   + +Reduce  data  from     +2D  to  1D   + +(cm)   + +Andrew  Ng   + + Data  Compression   +Reduce  data  from  3D  to  2D   + +Andrew  Ng   + + Dimensionality   +Reduc1on   +Mo1va1on  II:     +Data  Visualiza1on   +Machine  Learning   + + Data  Visualiza2on   +Mean   +Poverty   household   +Index   +income   + +Country   +Canada   +China   +India   +Russia   +Singapore   +USA   +…   + +Per  capita   +Human   +GDP   +GDP     +Life   +(trillions  of   (thousands   Develop-­‐ +  (Gini  as   (thousands   +of  intl.  $)   ment  Index  expectancy  percentage)   of  US$)   +US$)   +…   + +1.577   +5.878   +1.632   +1.48   +0.223   +14.527   +…   + +39.17   +7.54   +3.41   +19.84   +56.69   +46.86   +…   + +[resources  from  en.wikipedia.org]   + +0.908   +0.687   +0.547   +0.755   +0.866   +0.91   +…   + +80.7   +73   +64.7   +65.5   +80   +78.3   +…   + +32.6   +46.9   +36.8   +39.9   +42.5   +40.8   +…   + +67.293   +10.22   +0.735   +0.72   +67.1   +84.3   +…   + +…   +…   +…   +…   +…   +…   + +Andrew  Ng   + + Data  Visualiza2on   +Country   +Canada   + +1.6   + +1.2   + +China   +India   +Russia   +Singapore   +USA   +…   + +1.7   +1.6   +1.4   +0.5   +2   +…   + +0.3   +0.2   +0.5   +1.7   +1.5   +…   + +Andrew  Ng   + + Data  Visualiza2on   + +Andrew  Ng   + + Dimensionality   +Reduc1on   +Principal  Component   +Analysis  problem   +formula1on   +Machine  Learning   + + Principal  Component  Analysis  (PCA)  problem  formula2on   + +Andrew  Ng   + + Principal  Component  Analysis  (PCA)  problem  formula2on   + +Reduce  from  2-­‐dimension  to  1-­‐dimension:  Find  a  direc1on  (a  vector                                      )   +onto  which  to  project  the  data  so  as  to  minimize  the  projec1on  error.   +Reduce  from  n-­‐dimension  to  k-­‐dimension:  Find        vectors     +onto  which  to  project  the  data,  so  as  to  minimize  the  projec1on  error.   + +Andrew  Ng   + + PCA  is  not  linear  regression   + +Andrew  Ng   + + PCA  is  not  linear  regression   + +Andrew  Ng   + + Dimensionality   +Reduc1on   +Principal  Component   +Analysis  algorithm   +Machine  Learning   + + Data  preprocessing   +Training  set:   +Preprocessing  (feature  scaling/mean  normaliza1on):   + +Replace  each                with                              .   +If  different  features  on  different  scales  (e.g.,                      size  of  house,     +                    number  of  bedrooms),  scale  features  to  have  comparable   +range  of  values.   + +Andrew  Ng   + + Principal  Component  Analysis  (PCA)  algorithm   + +Reduce  data  from  2D  to  1D   + +Reduce  data  from  3D  to  2D   + +Andrew  Ng   + + Principal  Component  Analysis  (PCA)  algorithm   +Reduce  data  from      -­‐dimensions  to      -­‐dimensions   +Compute  “covariance  matrix”:   +   +   +Compute  “eigenvectors”  of  matrix          :   +  [U,S,V] = svd(Sigma);   + +Andrew  Ng   + + Principal  Component  Analysis  (PCA)  algorithm   +From [U,S,V] +   +   = svd(Sigma) +   +        ,  we  get:     + +Andrew  Ng   + + Principal  Component  Analysis  (PCA)  algorithm  summary   +Ader  mean  normaliza1on  (ensure  every  feature  has   +zero  mean)  and  op1onally  feature  scaling:     +Sigma = +[U,S,V] = svd(Sigma); +Ureduce = U(:,1:k); +z = Ureduce’*x;   + +Andrew  Ng   + + Dimensionality   +Reduc1on   +Reconstruc1on  from   +compressed   +representa1on   +Machine  Learning   + + Reconstruc2on  from  compressed  representa2on   + +Andrew  Ng   + + Dimensionality   +Reduc1on   +Choosing  the  number  of   +principal  components   +Machine  Learning   + + Choosing            (number  of  principal  components)   +Average  squared  projec1on  error:   +Total  varia1on  in  the  data:   +  Typically,  choose        to  be  smallest  value  so  that   +   +   +   +   +“99%  of  variance  is  retained”   + +(1%)   + +Andrew  Ng   + + Choosing            (number  of  principal  components)   +Algorithm:   +[U,S,V] = svd(Sigma)   +Try  PCA  with     +Compute   +   +Check  if   + +Andrew  Ng   + + Choosing            (number  of  principal  components)   +[U,S,V] = svd(Sigma)   +Pick  smallest  value  of          for  which   +   +   +   +(99%  of  variance  retained)   + +Andrew  Ng   + + Dimensionality   +Reduc1on   +Advice  for   +applying  PCA   +Machine  Learning   + + Supervised  learning  speedup   +Extract  inputs:   +  Unlabeled  dataset:   + +New  training  set:   +Note:  Mapping                                            should  be  defined  by  running  PCA   +only  on  the  training  set.  This  mapping  can  be  applied  as  well  to   +the  examples                      and                        in  the  cross  valida1on  and  test   + +Andrew  Ng   + + Applica2on  of  PCA   +-­‐  Compression   +-­‐  Reduce  memory/disk  needed  to  store  data   +-­‐  Speed  up  learning  algorithm   +-­‐  Visualiza1on   + +Andrew  Ng   + + Bad  use  of  PCA:  To  prevent  overfiEng   +Use                  instead  of                  to  reduce  the  number  of   +features  to     +Thus,  fewer  features,  less  likely  to  overfit.   +This  might  work  OK,  but  isn’t  a  good  way  to  address   +overfilng.  Use  regulariza1on  instead.   + +Andrew  Ng   + + PCA  is  some2mes  used  where  it  shouldn’t  be   +Design  of  ML  system:   +-­‐  Get  training  set   +-­‐  Run  PCA  to  reduce                  in  dimension  to  get   +-­‐  Train  logis1c  regression  on   +-­‐  Test  on  test  set:  Map                    to                  .  Run                        on   +   +How  about  doing  the  whole  thing  without  using  PCA?   +Before  implemen1ng  PCA,  first  try  running  whatever  you  want  to   +do  with  the  original/raw  data            .  Only  if  that  doesn’t  do  what   +you  want,  then  implement  PCA  and  consider  using            .   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture15.txt b/Lectures/mlclass/Lecture15.txt new file mode 100644 index 0000000..ccfec25 --- /dev/null +++ b/Lectures/mlclass/Lecture15.txt @@ -0,0 +1,359 @@ +Anomaly   +detec-on   +Problem   +mo-va-on   +Machine  Learning   + + Anomaly  detec-on  example   +Dataset:   +   +New  engine:   + +(vibra-on)   + +Aircra9  engine  features:   +  =  heat  generated   +  =  vibra-on  intensity   +          …   + +(heat)   + +Andrew  Ng   + + Density  es-ma-on   + +(vibra-on)   + +Dataset:   +Is                        anomalous?   + +(heat)   + +Andrew  Ng   + + Anomaly  detec-on  example   +Fraud  detec-on:   +                            =  features  of  user      ’s  ac-vi-es   +Model                  from  data.   +Iden-fy  unusual  users  by  checking  which  have     +Manufacturing   +Monitoring  computers  in  a  data  center.   +              =  features  of  machine     +              =  memory  use,                =  number  of  disk  accesses/sec,   +              =  CPU  load,                =  CPU  load/network  traffic.   +…   +   + +Andrew  Ng   + + Anomaly   +detec-on   +Gaussian   +distribu-on   +Machine  Learning   + + Gaussian  (Normal)  distribu-on   +Say                    .  If        is  a  distributed  Gaussian  with  mean        ,  variance            .   + +Andrew  Ng   + + Gaussian  distribu-on  example   + +Andrew  Ng   + + Parameter  es-ma-on   +Dataset:   + +Andrew  Ng   + + Anomaly   +detec-on   +Algorithm   +Machine  Learning   + + Density  es-ma-on   +Training  set:   +Each  example  is     + +Andrew  Ng   + + Anomaly  detec-on  algorithm   +1.  Choose  features            that  you  think  might  be  indica-ve  of   +anomalous  examples.   +2.  Fit  parameters   +   +   +   +   +3.  Given  new  example        ,  compute                  :     +   +              Anomaly  if     + +Andrew  Ng   + + Anomaly  detec-on  example   + +Andrew  Ng   + + Anomaly   +detec-on   +Developing  and   +evalua-ng  an  anomaly   +detec-on  system   +Machine  Learning   + + The  importance  of  real-­‐number  evalua-on   +When  developing  a  learning  algorithm  (choosing  features,  etc.),   +making  decisions  is  much  easier  if  we  have  a  way  of  evalua-ng   +our  learning  algorithm.   +Assume  we  have  some  labeled  data,  of  anomalous  and  non-­‐ +anomalous  examples.    (                        if  normal,                          if  anomalous).   +Training  set:   +anomalous)   + +   + +              (assume  normal  examples/not   + +Cross  valida-on  set:   +Test  set:   + +Andrew  Ng   + + AircraA  engines  mo-va-ng  example   +10000    good  (normal)  engines   +20   +  flawed  engines  (anomalous)   +Training  set:  6000  good  engines   +CV:  2000  good  engines  (   +    ),  10  anomalous  (                      )   +Test:  2000  good  engines  (                      ),  10  anomalous  (                      )   +Alterna-ve:   +Training  set:  6000  good  engines   +CV:  4000  good  engines  (   +    ),  10  anomalous  (                      )   +Test:  4000  good  engines  (                      ),  10  anomalous  (                      )   + +Andrew  Ng   + + Algorithm  evalua-on   +Fit  model                    on  training  set   +On  a  cross  valida-on/test  example          ,  predict   + +Possible  evalua-on  metrics:   +  -­‐  True  posi-ve,  false  posi-ve,  false  nega-ve,  true  nega-ve   +  -­‐  Precision/Recall   +  -­‐  F1-­‐score   +Can  also  use  cross  valida-on  set  to  choose  parameter     + +Andrew  Ng   + + Anomaly   +detec-on   +Machine  Learning   + +Anomaly  detec-on   +vs.  supervised   +learning   + + Anomaly  detec-on   + +vs.   + +Very  small  number  of  posi-ve   +examples  (                      ).  (0-­‐20  is   +common).   +Large  number  of  nega-ve  (                      )   +examples.   +Many  different  “types”  of   +anomalies.  Hard  for  any  algorithm   +to  learn  from  posi-ve  examples   +what  the  anomalies  look  like;   +future  anomalies  may  look  nothing   +like  any  of  the  anomalous   +examples  we’ve  seen  so  far.   + +Supervised  learning   + +Large  number  of  posi-ve  and   +nega-ve  examples.   +   +   +   +Enough  posi-ve  examples  for   +algorithm  to  get  a  sense  of  what   +posi-ve  examples  are  like,    future   +posi-ve  examples  likely  to  be   +similar  to  ones  in  training  set.   + +Andrew  Ng   + + Anomaly  detec-on   +•  Fraud  detec-on   +   +•  Manufacturing  (e.g.  aircra9   +engines)   +   +•  Monitoring  machines  in  a  data   +center   + +vs.   + +Supervised  learning   + +•  Email  spam  classifica-on   +   +•  Weather  predic-on  (sunny/ +rainy/etc).   +   +•  Cancer  classifica-on   + +Andrew  Ng   + + Anomaly   +detec-on   +Choosing  what   +features  to  use   +Machine  Learning   + + Non-­‐gaussian  features   + + Error  analysis  for  anomaly  detec-on   +Want   +  large  for  normal  examples        .   +   +  small  for  anomalous  examples        .   +Most  common  problem:   +   +  is  comparable  (say,  both  large)  for  normal   +  and  anomalous  examples   + + Monitoring  computers  in  a  data  center   +Choose  features  that  might  take  on  unusually  large  or   +small  values  in  the  event  of  an  anomaly.   +  =  memory  use  of  computer   +  =  number  of  disk  accesses/sec   +  =  CPU  load   +  =  network  traffic   +   + + Anomaly   +detec-on   +Mul-variate   +Gaussian  distribu-on   +Machine  Learning   + + (Memory  Use)   + +Mo-va-ng  example:  Monitoring  machines  in  a  data  center   + +(CPU  Load)   + +(CPU  Load)   +(Memory  Use)   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  distribu-on   +                          .  Don’t  model +   +   +        etc.  separately.   +Model                      all  in  one  go.   +Parameters:   +   +   +  (covariance  matrix)   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  examples   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  examples   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  examples   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  examples   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  examples   + +Andrew  Ng   + + Mul-variate  Gaussian  (Normal)  examples   + +Andrew  Ng   + + Anomaly   +detec-on   +Anomaly  detec-on  using   +the  mul-variate   +Gaussian  distribu-on   +Machine  Learning   + + Mul-variate  Gaussian  (Normal)  distribu-on   +Parameters     + +Parameter  fifng:   +Given  training  set   + +Andrew  Ng   + + Anomaly  detec-on  with  the  mul-variate  Gaussian   +1.  Fit  model                    by  sefng   + +2.  Given  a  new  example        ,  compute   +   +   +   +        Flag  an  anomaly  if     + +Andrew  Ng   + + Rela-onship  to  original  model   +Original  model:     + +Corresponds  to  mul-variate  Gaussian   +   +   +   +where   + +Andrew  Ng   + + Original  model   + +vs.   + +Mul-variate  Gaussian   + +Manually  create  features  to   +capture  anomalies  where                             +take  unusual  combina-ons  of   +values.   + +Automa-cally  captures   +correla-ons  between  features   + +Computa-onally  cheaper   +(alterna-vely,  scales  beger  to   +large   +        )   +OK  even  if          (training  set  size)  is   +small   + +Computa-onally  more  expensive   + +Must  have                          ,  or  else          is   +non-­‐inver-ble.   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture16.txt b/Lectures/mlclass/Lecture16.txt new file mode 100644 index 0000000..6b0c94c --- /dev/null +++ b/Lectures/mlclass/Lecture16.txt @@ -0,0 +1,615 @@ +Recommender   +Systems   +Problem   +formula4on   +Machine  Learning   + + Example:  Predic/ng  movie  ra/ngs   +User  rates  movies  using  one  to  five  stars   +Movie   + +Alice  (1)   + +Bob  (2)   + +Carol  (3)   + +Dave  (4)   + +Love  at  last   +Romance  forever   +Cute  puppies  of  love   +Nonstop  car  chases   +Swords  vs.  karate   + +=  no.  users   +=  no.  movies   +=  1  if  user        has   +rated  movie       +=  ra4ng  given  by   +user        to  movie         +(defined  only  if     +        )   + +Andrew  Ng   + + Recommender   +Systems   +Content-­‐based   +recommenda4ons   +Machine  Learning   + + Content-­‐based  recommender  systems   +Movie   + +Alice  (1)   + +Bob  (2)   + +Carol  (3)   + +Dave  (4)   + +   +(romance)   + +   +(ac/on)   + +Love  at  last   + +5   + +5   + +0   + +0   + +0.9   + +0   + +Romance  forever   + +5   + +?   + +?   + +0   + +1.0   + +0.01   + +Cute  puppies  of  love   + +?   + +4   + +0   + +?   + +0.99   + +0   + +Nonstop  car  chases   + +0   + +0   + +5   + +4   + +0.1   + +1.0   + +Swords  vs.  karate   + +0   + +0   + +5   + +?   + +0   + +0.9   + +For  each  user      ,  learn  a  parameter                              .  Predict  user        as   +ra4ng  movie          with                                    stars.     + +Andrew  Ng   + + Problem  formula/on   +          if  user        has  rated  movie          (0  otherwise)   +            ra4ng  by  user      on  movie        (if  defined)   +              =  parameter  vector  for  user     +              =  feature  vector  for  movie         +For  user        ,  movie        ,  predicted  ra4ng:   +                =  no.  of  movies  rated  by  user   +To  learn              :   + +Andrew  Ng   + + Op/miza/on  objec/ve:   +To  learn              (parameter  for  user      ):   + +To  learn   + +                                          :   + +Andrew  Ng   + + Op/miza/on  algorithm:   + +Gradient  descent  update:   + +Andrew  Ng   + + Recommender   +Systems   +Collabora4ve   +filtering   +Machine  Learning   + + Problem  mo/va/on   +Movie   + +Alice  (1)   + +Bob  (2)   + +Carol  (3)   + +Dave  (4)   + +   +(romance)   + +   +(ac/on)   + +Love  at  last   + +5   + +5   + +0   + +0   + +0.9   + +0   + +Romance  forever   + +5   + +?   + +?   + +0   + +1.0   + +0.01   + +Cute  puppies  of   +love   + +?   + +4   + +0   + +?   + +0.99   + +0   + +Nonstop  car   +chases   + +0   + +0   + +5   + +4   + +0.1   + +1.0   + +Swords  vs.  karate   + +0   + +0   + +5   + +?   + +0   + +0.9   + +Andrew  Ng   + + Problem  mo/va/on   +Movie   + +Alice  (1)   + +Bob  (2)   + +Carol  (3)   + +Dave  (4)   + +   +(romance)   + +   +(ac/on)   + +Love  at  last   + +5   + +5   + +0   + +0   + +?   + +?   + +Romance  forever   + +5   + +?   + +?   + +0   + +?   + +?   + +Cute  puppies  of   +love   + +?   + +4   + +0   + +?   + +?   + +?   + +Nonstop  car   +chases   + +0   + +0   + +5   + +4   + +?   + +?   + +Swords  vs.  karate   + +0   + +0   + +5   + +?   + +?   + +?   + +Andrew  Ng   + + Op/miza/on  algorithm   +Given   + +                  ,  to  learn            :   + +Given   + +                  ,  to  learn + +   + +  :   + +Andrew  Ng   + + Collabora/ve  filtering   +Given   +                    (and  movie  ra4ngs),     +  can  es4mate   +     +Given   +                  ,     +  can  es4mate   + +Andrew  Ng   + + Recommender   +Systems   +Collabora4ve   +filtering  algorithm   +Machine  Learning   + + Collabora/ve  filtering  op/miza/on  objec/ve   + +Given   + +                  ,  es4mate + +   + +  :   + +Given   + +                  ,  es4mate + +   + +    :   + +   + +  simultaneously:                         + +Minimizing   + +                and + +Andrew  Ng   + + Collabora/ve  filtering  algorithm   +1.  Ini4alize   +   +                          to  small  random  values.   +2.  Minimize   +   +   +   +  using  gradient   +descent  (or  an  advanced  op4miza4on  algorithm).  E.g.  for   +every +   +   +                    :   +   +   +   +   +3.  For  a  user  with  parameters              and  a  movie  with  (learned)   +features            ,  predict  a  star  ra4ng  of                      .   + +Andrew  Ng   + + Recommender   +Systems   + +Machine  Learning   + +Vectoriza4on:   +Low  rank  matrix   +factoriza4on   + + Collabora/ve  filtering   +Movie   + +Alice  (1)   + +Bob  (2)   + +Carol  (3)   + +Dave  (4)   + +Love  at  last   + +5   + +5   + +0   + +0   + +Romance  forever   + +5   + +?   + +?   + +0   + +Cute  puppies  of   +love   + +?   + +4   + +0   + +?   + +Nonstop  car   +chases   + +0   + +0   + +5   + +4   + +Swords  vs.  karate   + +0   + +0   + +5   + +?   + +Andrew  Ng   + + Collabora/ve  filtering   +Predicted  ra4ngs:   + +Andrew  Ng   + + Finding  related  movies   +For  each  product        ,  we  learn  a  feature  vector                                .   + +How  to  find  movies        related  to  movie      ?   + +5  most  similar  movies  to  movie      :   +Find  the  5  movies        with  the  smallest + +                  .   + +Andrew  Ng   + + Recommender   +Systems   +   + +Machine  Learning   + +Implementa4onal   +detail:    Mean   +normaliza4on   + + Users  who  have  not  rated  any  movies   +Movie   + +Alice  (1)   + +Bob  (2)   + +Carol  (3)   + +Dave  (4)   + +Eve  (5)   + +Love  at  last   + +5   + +5   + +0   + +0   + +?   + +Romance  forever   + +5   + +?   + +?   + +0   + +?   + +Cute  puppies  of  love   + +?   + +4   + +0   + +?   + +?   + +Nonstop  car  chases   + +0   + +0   + +5   + +4   + +?   + +Swords  vs.  karate   + +0   + +0   + +5   + +?   + +?   + +Andrew  Ng   + + Mean  Normaliza/on:   + +For  user        ,  on  movie        predict:   + +User  5  (Eve):   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture17.txt b/Lectures/mlclass/Lecture17.txt new file mode 100644 index 0000000..b360bf8 --- /dev/null +++ b/Lectures/mlclass/Lecture17.txt @@ -0,0 +1,299 @@ +Large  scale   +machine  learning   + +Learning  with   +large  datasets   +Machine  Learning   + + Classify  between  confusable  words.   +E.g.,  {to,  two,  too},  {then,  than}.   +   +For  breakfast  I  ate  _____  eggs.   + +          Accuracy             + +Machine  learning  and  data   + +Training  set  size  (millions)   + +“It’s  not  who  has  the  best  algorithm  that  wins.     +    It’s  who  has  the  most  data.”   +[Figure  from  Banko  and  Brill,  2001]   + +Andrew  Ng   + + error   + +error   + +Learning  with  large  datasets   + +(training  set  size)   + +(training  set  size)   + +Andrew  Ng   + + Large  scale   +machine  learning   +StochasQc   +gradient  descent   +Machine  Learning   + + Linear  regression  with  gradient  descent   + +Repeat  {   +   +   +            (for  every                                        )   +}   + +Andrew  Ng   + + Linear  regression  with  gradient  descent   + +Repeat  {   +   +   +            (for  every                                        )   +}   + +Andrew  Ng   + + Batch  gradient  descent   + +Repeat  {   +   +   +               +   +             +    (for  every +}   + +Stochas4c  gradient  descent   + +                    )   + +Andrew  Ng   + + Stochas4c  gradient  descent   +1.  Randomly  shuffle  (reorder)   +training  examples   +2.  Repeat  {   +      for                                            {   +       + +                        (for   +every                                          )   +                    }   +              }   + +Andrew  Ng   + + Large  scale   +machine  learning   +Mini-­‐batch   +gradient  descent   +Machine  Learning   + + Mini-­‐batch  gradient  descent   +Batch  gradient  descent:  Use  all          examples  in  each  iteraQon   +StochasQc  gradient  descent:  Use  1  example  in  each  iteraQon   +Mini-­‐batch  gradient  descent:  Use        examples  in  each  iteraQon   + +Andrew  Ng   + + Mini-­‐batch  gradient  descent   +Say   +   +      .                                                               +Repeat  {   +        for         +   +   +            {   +   +   +          (for  every                                            )   +      }   +}   + +Andrew  Ng   + + Machine  Learning   + +Large  scale   +machine  learning   +StochasQc   +gradient  descent   +convergence   + + Checking  for  convergence   +Batch  gradient  descent:   +Plot                                  as  a  funcQon  of  the  number  of  iteraQons  of   +gradient  descent.   +StochasQc  gradient  descent:   +   +During  learning,  compute                                                          before  updaQng     +using                                  .   +Every  1000  iteraQons  (say),  plot   +     +    averaged   +over  the  last  1000  examples  processed  by  algorithm.   + +Andrew  Ng   + + Checking  for  convergence   +Plot + +   + +                          ,  averaged  over  the  last  1000  (say)  examples   + +No.  of  iteraQons   + +No.  of  iteraQons   + +No.  of  iteraQons   + +No.  of  iteraQons   + +Andrew  Ng   + + Stochas4c  gradient  descent   + +1.  Randomly  shuffle  dataset.   +2.  Repeat  {   +        for   +   +  {   +     +                                      (for                                        )   +        }   +}   + +Learning  rate          is  typically  held  constant.  Can  slowly  decrease         +const1 +over  Qme  if  we  want          to  converge.  (E.g.   iterationNumber +   +            )   ++ const2 + +Andrew  Ng   + + Stochas4c  gradient  descent   + +1.  Randomly  shuffle  dataset.   +2.  Repeat  {   +        for   +   +  {   +     +                                      (for                                        )   +        }   +}   + +Learning  rate          is  typically  held  constant.  Can  slowly  decrease         +const1 +over  Qme  if  we  want          to  converge.  (E.g.   iterationNumber +   +            )   ++ const2 + +Andrew  Ng   + + Large  scale   +machine  learning   + +Online  learning   +Machine  Learning   + + Online  learning   +Shipping  service  website  where  user  comes,  specifies  origin  and   +desQnaQon,  you  offer  to  ship  their  package  for  some  asking  price,   +and  users  someQmes  choose  to  use  your  shipping  service  (                  ),   +someQmes  not  (                    ).   +Features            capture  properQes  of  user,  of  origin/desQnaQon  and   +asking  price.  We  want  to  learn +                          to  opQmize  price.     + +Andrew  Ng   + + Other  online  learning  example:   +Product  search  (learning  to  search)   +User  searches  for  “Android  phone  1080p  camera”   +Have  100  phones  in  store.  Will  return  10  results.   +              features  of  phone,  how  many  words  in  user  query  match   +name  of  phone,  how  many  words  in  query  match  descripQon   +of  phone,  etc.   +        if  user  clicks  on  link.   +                        otherwise.   +Learn                                          .   +Use  to  show  user  the  10  phones  they’re  most  likely  to  click  on.   +Other  examples:  Choosing  special  offers  to  show  user;  customized   +selecQon  of  news  arQcles;  product  recommendaQon;  …   + +Andrew  Ng   + + Large  scale   +machine  learning   +Map-­‐reduce  and   +data  parallelism   +Machine  Learning   + + Map-­‐reduce   +Batch  gradient  descent:   +Machine  1:  Use   + +Machine  2:  Use   +Machine  3:  Use   +Machine  4:  Use   + +[  Jeffrey  Dean  and  Sanjay  Ghemawat]   + +Andrew  Ng   + + Map-­‐reduce   +Computer  1   + +Training  set   + +  Computer  2   + +Combine  results   + +Computer  3   + +[hkp://openclipart.org/detail/17924/computer-­‐by-­‐aj]   + +Computer  4   + +Andrew  Ng   + + Map-­‐reduce  and  summa4on  over  the  training  set   +Many  learning  algorithms  can  be  expressed  as  compuQng  sums   +of  funcQons  over  the  training  set.     +E.g.  for  advanced  opQmizaQon,  with  logisQc  regression,  need:   + +Andrew  Ng   + + Mul4-­‐core  machines   +Core  1   + +Training  set   + +  Core  2   + +Combine  results   + +Core  3   + +[hkp://openclipart.org/detail/100267/cpu-­‐ +(central-­‐processing-­‐unit)-­‐by-­‐ivak-­‐100267]   + +Core  4   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture18.txt b/Lectures/mlclass/Lecture18.txt new file mode 100644 index 0000000..1dee4ed --- /dev/null +++ b/Lectures/mlclass/Lecture18.txt @@ -0,0 +1,332 @@ +Applica'on  example:     +Photo  OCR   +Problem  descrip'on   +and  pipeline   +Machine  Learning   + + The  Photo  OCR  problem   + +Andrew  Ng   + + Photo  OCR  pipeline   +1.  Text  detec'on   + +2.  Character  segmenta'on   +3.  Character  classifica'on   +A + +N + +T + +Andrew  Ng   + + Photo  OCR  pipeline   + +Image   + +Text  detec8on   + +Character   +segmenta8on   + +Character   +recogni8on   + + Applica'on  example:     +Photo  OCR   + +Sliding  windows   +Machine  Learning   + + Text  detec8on   + +Pedestrian  detec8on   + +Andrew  Ng   + + Supervised  learning  for  pedestrian  detec8on   +pixels  in  82x36  image  patches   + +Posi've  examples   + +Nega've  examples   + +Andrew  Ng   + + Sliding  window  detec8on   + +Andrew  Ng   + + Sliding  window  detec8on   + +Andrew  Ng   + + Sliding  window  detec8on   + +Andrew  Ng   + + Sliding  window  detec8on   + +Andrew  Ng   + + Text  detec8on   + +Andrew  Ng   + + Text  detec8on   + +Posi've  examples   + +Nega've  examples   + +Andrew  Ng   + + Text  detec8on   + +[David  Wu]   + +Andrew  Ng   + + 1D  Sliding  window  for  character  segmenta8on   + +Posi've  examples   + +Nega've  examples   + +Andrew  Ng   + + Photo  OCR  pipeline   +1.  Text  detec'on   + +2.  Character  segmenta'on   +3.  Character  classifica'on   +A + +N + +T + +Andrew  Ng   + + Applica'on  example:     +Photo  OCR   + +Machine  Learning   + +GeIng  lots  of   +data:  Ar'ficial   +data  synthesis   + + Character  recogni8on   + +A +I + +T + +N +Q + +A + +Andrew  Ng   + + Ar8ficial  data  synthesis  for  photo  OCR   + +Abcdefg +Abcdefg + +Abcdefg +Abcdefg +Abcdefg +Real  data   +[Adam  Coates  and  Tao  Wang]   + +Andrew  Ng   + + Ar8ficial  data  synthesis  for  photo  OCR   + +Real  data   +[Adam  Coates  and  Tao  Wang]   + +Synthe'c  data   + +Andrew  Ng   + + Synthesizing  data  by  introducing  distor8ons   + +[Adam  Coates  and  Tao  Wang]   + +Andrew  Ng   + + Synthesizing  data  by  introducing  distor8ons:  Speech  recogni8on   +Original  audio:   +   +Audio  on  bad  cellphone  connec'on   +Noisy  background:  Crowd   +Noisy  background:  Machinery   +[www.pdsounds.org]   + +Andrew  Ng   + + Synthesizing  data  by  introducing  distor8ons           +Distor'on  introduced  should  be  representa'on  of  the  type  of   +noise/distor'ons  in  the  test  set.   +Audio:   +Background  noise,     +bad  cellphone  connec'on   +Usually  does  not  help  to  add  purely  random/meaningless  noise   +to  your  data.   +intensity  (brightness)  of  pixel   +                random  noise   +[Adam  Coates  and  Tao  Wang]   + +Andrew  Ng   + + Discussion  on  geJng  more  data   +1.  Make  sure  you  have  a  low  bias  classifier  before  expending  the   +effort.  (Plot  learning  curves).  E.g.  keep  increasing  the  number   +of  features/number  of  hidden  units  in  neural  network  un'l   +you  have  a  low  bias  classifier.   +2.  “How  much  work  would  it  be  to  get  10x  as  much  data  as  we   +currently  have?”   +-­‐  Ar'ficial  data  synthesis   +-­‐  Collect/label  it  yourself   +-­‐  “Crowd  source”  (E.g.  Amazon  Mechanical  Turk)   + +Andrew  Ng   + + Discussion  on  geJng  more  data   +1.  Make  sure  you  have  a  low  bias  classifier  before  expending  the   +effort.  (Plot  learning  curves).  E.g.  keep  increasing  the  number   +of  features/number  of  hidden  units  in  neural  network  un'l   +you  have  a  low  bias  classifier.   +2.  “How  much  work  would  it  be  to  get  10x  as  much  data  as  we   +currently  have?”   +-­‐  Ar'ficial  data  synthesis   +-­‐  Collect/label  it  yourself   +-­‐  “Crowd  source”  (E.g.  Amazon  Mechanical  Turk)   + +Andrew  Ng   + + Applica'on  example:     +Photo  OCR   +Ceiling  analysis:  What   +part  of  the  pipeline  to   +work  on  next   +Machine  Learning   + + Es8ma8ng  the  errors  due  to  each  component  (ceiling  analysis)   +Image   + +Text  detec8on   + +Character   +segmenta8on   + +Character   +recogni8on   + +What  part  of  the  pipeline  should  you  spend  the  most  'me   +trying  to  improve?   +Component   + +Accuracy   + +Overall  system   + +72%   + +Text  detec'on   + +89%   + +Character  segmenta'on   + +90%   + +Character  recogni'on   + +100%   + +Andrew  Ng   + + Another  ceiling  analysis  example   +Face  recogni'on  from  images     +(Ar'ficial  example)   +Camera! +image! + +Preprocess! +(remove background)! + +Eyes segmentation! + +Face detection! + +Nose segmentation! + +Logistic regression! + +Label" + +Mouth +segmentation! + +Andrew  Ng   + + Another  ceiling  analysis  example   +Camera! +image! + +Preprocess! +(remove background)! +Eyes segmentation! + +Face detection! + +Nose segmentation! +Mouth +segmentation! + +Logistic regression! + +Label" + +Component   + +Accuracy   + +Overall  system   + +85%   + +Preprocess  (remove   +background)   + +85.1%   + +Face  detec'on   + +91%   + +Eyes  segmenta'on   + +95%   + +Nose  segmenta'on   + +96%   + +Mouth  segmenta'on     + +97%   + +Logis'c  regression   + +100%   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture2.txt b/Lectures/mlclass/Lecture2.txt new file mode 100644 index 0000000..7e14e62 --- /dev/null +++ b/Lectures/mlclass/Lecture2.txt @@ -0,0 +1,568 @@ +Linear  regression   +with  one  variable   + +Model   +representa6on   +Machine  Learning   + +Andrew  Ng   + + Housing  Prices   +(Portland,  OR)   + +500   +400   +300   + +Price   200   +(in  1000s   100   +of  dollars)   0   +0   + +500   + +1000   + +1500   + +2000   + +2500   + +3000   + +Size  (feet2)   +Supervised  Learning   + +Regression  Problem   + +Given  the  “right  answer”  for   +each  example  in  the  data.   + +Predict  real-­‐valued  output   + +   + +   + +Andrew  Ng   + + Training  set  of   +housing  prices   +(Portland,  OR)   + +Size  in  feet2  (x)   +2104   +1416   +1534   +852   +…   + +Price  ($)  in  1000's  (y)   +460   +232   +315   +178   +…   + +Nota6on:   +   +      m  =  Number  of  training  examples   +      x’s  =  “input”  variable  /  features   +      y’s  =  “output”  variable  /  “target”  variable   +   + +Andrew  Ng   + + How  do  we  represent  h  ?   + +Training  Set   + +Learning  Algorithm   +Size  of   +house   + +h   + +Es6mated   +price   +Linear  regression  with  one  variable.   +Univariate  linear  regression.   + +Andrew  Ng   + + Linear  regression   +with  one  variable   + +Cost  func6on   +Machine  Learning   + +Andrew  Ng   + + Training  Set   + +Size  in  feet2  (x)   +2104   +1416   +1534   +852   +…   + +Price  ($)  in  1000's  (y)   +460   +232   +315   +178   +…   + +Hypothesis:   +‘s:            Parameters   +How  to  choose          ‘s  ?   + +Andrew  Ng   + + 3   + +3   + +3   + +2   + +2   + +2   + +1   + +1   + +1   + +0   + +0   + +0   + +0   + +1   + +2   + +3   + +0   + +1   + +2   + +3   + +0   + +1   + +2   + +3   + +Andrew  Ng   + + y   + +x   + +Idea:  Choose                          so  that                                         +                      is  close  to          for  our   +training  examples     + +Andrew  Ng   + + Linear  regression   +with  one  variable   + +Cost  func6on   +intui6on  I   +Machine  Learning   + +Andrew  Ng   + + Hypothesis:   + +Simplified   + +Parameters:   +Cost  Func6on:   + +Goal:   + +Andrew  Ng   + + (for  fixed          ,  this  is  a  func6on  of  x)   + +y   + +(func6on  of  the  parameter            )   + +3   + +3   + +2   + +2   + +1   + +1   + +0   +0   + +1   + +x   + +2   + +3   + +0   +-­‐0.5   0   0.5   1   1.5   2   2.5   + +Andrew  Ng   + + (for  fixed          ,  this  is  a  func6on  of  x)   + +y   + +(func6on  of  the  parameter            )   + +3   + +3   + +2   + +2   + +1   + +1   + +0   +0   + +1   + +x   + +2   + +3   + +0   +-­‐0.5   0   0.5   1   1.5   2   2.5   + +Andrew  Ng   + + (for  fixed          ,  this  is  a  func6on  of  x)   + +y   + +(func6on  of  the  parameter            )   + +3   + +3   + +2   + +2   + +1   + +1   + +0   +0   + +1   + +x   + +2   + +3   + +0   +-­‐0.5   0   0.5   1   1.5   2   2.5   + +Andrew  Ng   + + Linear  regression   +with  one  variable   + +Cost  func6on   +intui6on  II   +Machine  Learning   + +Andrew  Ng   + + Hypothesis:   +Parameters:   +Cost  Func6on:   +Goal:   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +500   +400   + +Price  ($)     300   +in  1000’s   +200   +100   +0   +0   + +1000   + +2000   + +Size  in  feet2  (x)   + +3000   + +Andrew  Ng   + + Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + Linear  regression   +with  one  variable   + +Machine  Learning   + +Gradient   +descent   + +Andrew  Ng   + + Have  some  func6on   +Want     +Outline:   +•  Start  with  some   +•  Keep  changing                            to  reduce                                           +un6l  we  hopefully  end  up  at  a  minimum   + +Andrew  Ng   + + J(θ0,θ1) + + +θ1 + +θ0 + + +Andrew  Ng   + + J(θ0,θ1) + + +θ1 + +θ0 + + +Andrew  Ng   + + Gradient  descent  algorithm   + +Correct:  Simultaneous  update   + +Incorrect:   + +Andrew  Ng   + + Linear  regression   +with  one  variable   + +Gradient  descent   +intui6on   +Machine  Learning   + +Andrew  Ng   + + Gradient  descent  algorithm   + +Andrew  Ng   + + Andrew  Ng   + + If  α  is  too  small,  gradient  descent   +can  be  slow.   + +If  α  is  too  large,  gradient  descent   +can  overshoot  the  minimum.  It  may   +fail  to  converge,  or  even  diverge.   + +Andrew  Ng   + + at  local  op6ma   +Current  value  of     + +Andrew  Ng   + + Gradient  descent  can  converge  to  a  local   +minimum,  even  with  the  learning  rate  α  fixed.   + +As  we  approach  a  local   +minimum,  gradient   +descent  will  automa6cally   +take  smaller  steps.  So,  no   +need  to  decrease  α  over   +6me.     + +Andrew  Ng   + + Linear  regression   +with  one  variable   + +Gradient  descent  for     +linear  regression   +Machine  Learning   + +Andrew  Ng   + + Gradient  descent  algorithm   + +Linear  Regression  Model   + +Andrew  Ng   + + Andrew  Ng   + + Gradient  descent  algorithm   +update     +and   +simultaneously   +   + +Andrew  Ng   + + J(θ0,θ1) + + +θ1 + +θ0 + + +Andrew  Ng   + + J(θ0,θ1) + + +θ1 + +θ0 + + +Andrew  Ng   + + Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + (for  fixed                      ,  this  is  a  func6on  of  x)   + +(func6on  of  the  parameters                        )   + +Andrew  Ng   + + “Batch”  Gradient  Descent   + +“Batch”:  Each  step  of  gradient  descent   +uses  all  the  training  examples.   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture3.txt b/Lectures/mlclass/Lecture3.txt new file mode 100644 index 0000000..4f74ea0 --- /dev/null +++ b/Lectures/mlclass/Lecture3.txt @@ -0,0 +1,200 @@ +Linear  Algebra   +review  (op3onal)   + +Matrices  and   +vectors   +Machine  Learning   + +Andrew  Ng   + + Matrix:  Rectangular  array  of  numbers:   + +Dimension  of  matrix:  number  of  rows  x  number  of  columns   + +Andrew  Ng   + + Matrix  Elements  (entries  of  matrix)   + +“    ,      entry”  in  the              row,                column.   + +Andrew  Ng   + + Vector:  An  n  x  1  matrix.   + +element   + +1-­‐indexed  vs  0-­‐indexed:   + +Andrew  Ng   + + Linear  Algebra   +review  (op3onal)   +Addi3on  and  scalar   +mul3plica3on   +Machine  Learning   + +Andrew  Ng   + + Matrix  Addi4on   + +Andrew  Ng   + + Scalar  Mul4plica4on   + +Andrew  Ng   + + Combina4on  of  Operands   + +Andrew  Ng   + + Linear  Algebra   +review  (op3onal)   + +Matrix-­‐vector   +mul3plica3on   +Machine  Learning   + +Andrew  Ng   + + Example   + +Andrew  Ng   + + Details:   + +m  x  n  matrix   +(m  rows,   +n  columns)   + +n  x  1  matrix   +m-­‐dimensional   +(n-­‐dimensional   +vector   +vector)   + +To  get          ,  mul3ply        ’s              row  with  elements   +of  vector      ,  and  add  them  up.   + +Andrew  Ng   + + Example   + +Andrew  Ng   + + House  sizes:   + +Andrew  Ng   + + Linear  Algebra   +review  (op3onal)   + +Matrix-­‐matrix   +mul3plica3on   +Machine  Learning   + +Andrew  Ng   + + Example   + +Andrew  Ng   + + Details:   + +m  x  n  matrix   +(m  rows,   +n  columns)   + +n  x  o  matrix   +(n  rows,   +o  columns)   + +m  x  o   +matrix   + +The              column  of  the  matrix            is  obtained  by  mul3plying   +        with  the              column  of          .  (for        =  1,2,…,o)   + +Andrew  Ng   + + Example   +7 + +2 + +7 + +Andrew  Ng   + + House  sizes:   + +Have  3  compe3ng  hypotheses:   +1.   +2.   +3.   + +Matrix   + +Matrix   + +Andrew  Ng   + + Linear  Algebra   +review  (op3onal)   +Matrix  mul3plica3on   +proper3es   +Machine  Learning   + +Andrew  Ng   + + Let          and          be  matrices.  Then  in  general,   +(not  commuta3ve.)   +E.g.   + +Andrew  Ng   + + Let   +Let   + +Compute   +Compute   + +Andrew  Ng   + + Iden4ty  Matrix   +Denoted          (or                    ).   +Examples  of  iden3ty  matrices:   +2  x  2   + +3  x  3   + +For  any  matrix        ,   + +4  x  4   + +Andrew  Ng   + + Linear  Algebra   +review  (op3onal)   + +Inverse  and   +transpose   +Machine  Learning   + +Andrew  Ng   + + Not  all  numbers  have  an  inverse.   +Matrix  inverse:   +If  A  is  an  m  x  m  matrix,  and  if  it  has  an  inverse,   + +Matrices  that  don’t  have  an  inverse  are  “singular”  or  “degenerate”   + +Andrew  Ng   + + Matrix  Transpose   +Example:   +Let          be  an  m  x  n  matrix,  and  let     +Then          is  an  n  x  m  matrix,  and   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture4.txt b/Lectures/mlclass/Lecture4.txt new file mode 100644 index 0000000..b8d64d6 --- /dev/null +++ b/Lectures/mlclass/Lecture4.txt @@ -0,0 +1,373 @@ +Linear  Regression  with   +mul2ple  variables   + +Mul2ple  features   +Machine  Learning   + + Mul4ple  features  (variables).   +Size  (feet2)   Price  ($1000)   +   +   + +   +2104   +1416   +1534   +852   +…   + +   +460   +232   +315   +178   +…   + +Andrew  Ng   + + Mul4ple  features  (variables).   +Size  (feet2)   Number  of   Number  of   +   +bedrooms   +floors   + +   + +   + +   + +2104   +1416   +1534   +852   +…   + +5   +3   +3   +2   +…   + +1   +2   +2   +1   +…   + +Age  of  home   +(years)   + +Price  ($1000)   +   + +45   +40   +30   +36   +…   + +460   +232   +315   +178   +…   + +   + +   + +Nota2on:   +=  number  of  features   +=  input  (features)  of                training  example.   +=  value  of  feature        in                training  example.   + +Andrew  Ng   + + Hypothesis:   +  Previously:   + +Andrew  Ng   + + For  convenience  of  nota2on,  define                                .   + +Mul2variate  linear  regression.   + +Andrew  Ng   + + Linear  Regression  with   +mul2ple  variables   + +Gradient  descent  for   +mul2ple  variables   +Machine  Learning   + + Hypothesis:   +Parameters:   +Cost  func2on:   + +Gradient  descent:   +Repeat   +(simultaneously  update  for  every                                                )   + +Andrew  Ng   + + Gradient  Descent   +Previously  (n=1):   + +New  algorithm                              :   +Repeat   + +Repeat   + +(simultaneously  update                for             +                                                )   + +(simultaneously  update                          )   + +Andrew  Ng   + + Linear  Regression  with   +mul2ple  variables   +Gradient  descent  in   +prac2ce  I:  Feature  Scaling   +Machine  Learning   + + Feature  Scaling   +Idea:  Make  sure  features  are  on  a  similar  scale.   +E.g.              =  size  (0-­‐2000  feet2)   +                            =  number  of  bedrooms  (1-­‐5)   + +size  (feet2)   +number  of  bedrooms   + +Andrew  Ng   + + Feature  Scaling   +Get  every  feature  into  approximately  a                                                      range.   + +Andrew  Ng   + + Mean  normaliza4on   +Replace            with                                to  make  features  have  approximately  zero  mean   +(Do  not  apply  to                            ).   +E.g.     + +Andrew  Ng   + + Linear  Regression  with   +mul2ple  variables   +Gradient  descent  in   +prac2ce  II:  Learning  rate   +Machine  Learning   + + Gradient  descent   + +-­‐  “Debugging”:  How  to  make  sure  gradient   +descent  is  working  correctly.   +-­‐  How  to  choose  learning  rate          .   + +Andrew  Ng   + + Making  sure  gradient  descent  is  working  correctly.   + +Example  automa2c   +convergence  test:   + +0   + +100   + +200   + +300   + +400   + +Declare  convergence  if               +decreases  by  less  than               +in  one  itera2on.   + +No.  of  itera2ons   + +Andrew  Ng   + + Making  sure  gradient  descent  is  working  correctly.   +Gradient  descent  not  working.     +Use  smaller        .     +No.  of  itera2ons   + +No.  of  itera2ons   + +No.  of  itera2ons   + +-­‐  For  sufficiently  small          ,                          should  decrease  on  every  itera2on.   +-­‐  But  if            is  too  small,  gradient  descent  can  be  slow  to  converge.   + +Andrew  Ng   + + Summary:   +-­‐  If          is  too  small:  slow  convergence.   +-­‐  If          is  too  large:                  may  not  decrease  on   +every  itera2on;  may  not  converge.   + +To  choose        ,  try   + +Andrew  Ng   + + Linear  Regression  with   +mul2ple  variables   +Features  and   +polynomial  regression   +Machine  Learning   + + Housing  prices  predic4on   + +Andrew  Ng   + + Polynomial  regression   + +Price   +(y)   + +Size  (x)   + +Andrew  Ng   + + Choice  of  features   + +Price   +(y)   + +Size  (x)   + +Andrew  Ng   + + Linear  Regression  with   +mul2ple  variables   + +Normal  equa2on   +Machine  Learning   + + Gradient  Descent   + +Normal  equa2on:  Method  to  solve  for     +analy2cally.   + +Andrew  Ng   + + Intui2on:  If  1D   + +(for  every      )   +Solve  for     + +Andrew  Ng   + + Examples:     +Size  (feet2)   Number  of   Number  of   +   +bedrooms   +floors   + +1   +1   +1   +1   + +   + +   + +   + +2104   +1416   +1534   +852   + +5   +3   +3   +2   + +1   +2   +2   +1   + +Age  of  home   +(years)   + +Price  ($1000)   +   + +45   +40   +30   +36   + +460   +232   +315   +178   + +   + +   + +Andrew  Ng   + +            examples                                                                                                            ;            features.   + +E.g.        If   + +Andrew  Ng   + + is  inverse  of  matrix                          .   + +Octave:     pinv(X’*X)*X’*y + +Andrew  Ng   + +            training  examples,          features.   +Gradient  Descent   +Normal  Equa2on   +•  Need  to  choose        .     +•  Needs  many  itera2ons.   +•    Works  well  even   +when          is  large.   + +•  No  need  to  choose        .   +•  Don’t  need  to  iterate.   +•  Need  to  compute   +•  Slow  if          is  very  large.   + +Andrew  Ng   + + Linear  Regression  with   +mul2ple  variables   + +Machine  Learning   + +Normal  equa2on   +and  non-­‐inver2bility   +(op2onal)   + + Normal  equa2on   + +-­‐  What  if                          is  non-­‐inver2ble?  (singular/   +degenerate)   +-­‐  Octave:    pinv(X’*X)*X’*y + +Andrew  Ng   + + What  if                      is  non-­‐inver2ble?   +•  Redundant  features  (linearly  dependent).   +E.g.                        size  in  feet2   +                                      size  in  m2   +•  Too  many  features  (e.g.                          ).   +-­‐  Delete  some  features,  or  use  regulariza2on.   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture6.txt b/Lectures/mlclass/Lecture6.txt new file mode 100644 index 0000000..2337e18 --- /dev/null +++ b/Lectures/mlclass/Lecture6.txt @@ -0,0 +1,338 @@ +Logis&c   +Regression   + +Classifica&on   +Machine  Learning   + + Classifica(on   +Email:  Spam  /  Not  Spam?   +Online  Transac&ons:  Fraudulent  (Yes  /  No)?   +Tumor:  Malignant  /  Benign  ?   +0:  “Nega&ve  Class”  (e.g.,  benign  tumor)   +   +1:  “Posi&ve  Class”  (e.g.,  malignant  tumor)   +   + +Andrew  Ng   + + (Yes)  1   + +Malignant  ?   +(No)  0   + +Tumor  Size   + +Tumor  Size   + +Threshold  classifier  output                          at  0.5:   +If                                                ,  predict  “y  =  1”   +If                                                ,  predict  “y  =  0”   + +Andrew  Ng   + + Classifica&on:        y      =      0      or      1   +can  be  >  1  or  <  0   +Logis&c  Regression:   + +Andrew  Ng   + + Logis&c   +Regression   +Hypothesis   +Representa&on   +Machine  Learning   + + Logis(c  Regression  Model   +Want   + +1   +0.5   + +Sigmoid  func&on   +Logis&c  func&on   + +0   + +Andrew  Ng   + + Interpreta(on  of  Hypothesis  Output   +=  es&mated  probability  that  y  =  1  on  input  x     +Example:    If     + +Tell  pa&ent  that  70%  chance  of  tumor  being  malignant     +“probability  that  y  =  1,  given  x,   +    parameterized  by        ”   + +Andrew  Ng   + + Logis&c   +Regression   +Decision  boundary   +Machine  Learning   + + Logis(c  regression   + +1   + +z   +    Suppose  predict  “                    “  if   + +          predict  “                    “    if   + +Andrew  Ng   + + Decision  Boundary   +x2   +3 +2 +1 +1 + +2 + +3 + +x1   + +Predict  “                    “  if     + +Andrew  Ng   + + Non-­‐linear  decision  boundaries   +x2   +1   + +1   + +-­‐1   + +x1   + +Predict  “                    “  if     + +-­‐1   + +x2   + +x1   + +Andrew  Ng   + + Logis&c   +Regression   +Cost  func&on   +Machine  Learning   + + Training   +set:   +m  examples   + +How  to  choose  parameters        ?   + +Andrew  Ng   + + Cost  func(on   +Linear  regression:   + +“non-­‐convex”   + +“convex”   + +Andrew  Ng   + + Logis(c  regression  cost  func(on   + +If  y  =  1   + +0   + +1   + +Andrew  Ng   + + Logis(c  regression  cost  func(on   + +If  y  =  0   + +0   + +1   + +Andrew  Ng   + + Logis&c   +Regression   +Simplified  cost  func&on   +and  gradient  descent   +Machine  Learning   + + Logis(c  regression  cost  func(on   + +Andrew  Ng   + + Logis(c  regression  cost  func(on   + +To  fit  parameters        :     +To  make  a  predic&on  given  new      :   +Output     + +Andrew  Ng   + + Gradient  Descent   + +Want                                        :   +Repeat   + +(simultaneously  update  all          )   + +Andrew  Ng   + + Gradient  Descent   + +Want                                        :   +Repeat   + +(simultaneously  update  all          )   + +Algorithm  looks  iden&cal  to  linear  regression!   + +Andrew  Ng   + + Logis&c   +Regression   + +Advanced     +op&miza&on   +Machine  Learning   + + Op(miza(on  algorithm   +Cost  func&on                  .  Want                                        .   +Given        ,  we  have  code  that  can  compute   +-­‐      +-­‐      +(for                                                          )   +Gradient  descent:   +Repeat   + +Andrew  Ng   + + Op(miza(on  algorithm   +Given        ,  we  have  code  that  can  compute   +-­‐      +-­‐      +(for                                                          )   +Op&miza&on  algorithms:   +  -­‐  Gradient  descent   +-­‐  Conjugate  gradient   +-­‐  BFGS   +-­‐  L-­‐BFGS   + +Advantages:   +-­‐  No  need  to  manually  pick     +-­‐  Oeen  faster  than  gradient   +descent.   +Disadvantages:   +-­‐  More  complex   +   + +Andrew  Ng   + + Example:   + +function [jVal, gradient] += costFunction(theta) +jVal = (theta(1)-5)^2 + ... +(theta(2)-5)^2; +gradient = zeros(2,1); +gradient(1) = 2*(theta(1)-5); +gradient(2) = 2*(theta(2)-5); + +options = optimset(‘GradObj’, ‘on’, ‘MaxIter’, ‘100’); +initialTheta = zeros(2,1); +[optTheta, functionVal, exitFlag] ... += fminunc(@costFunction, initialTheta, options); + +Andrew  Ng   + + theta = + +function [jVal, gradient] = costFunction(theta) +jVal = [ code  to  compute   + +]; + +gradient(1) = [code  to  compute   + +]; + +gradient(2) = [code  to  compute   + +]; + +gradient(n+1) = [ code  to  compute   + +]; + +Andrew  Ng   + + Logis&c   +Regression   +Mul&-­‐class  classifica&on:   +One-­‐vs-­‐all   +Machine  Learning   + + Mul(class  classifica(on   +Email  foldering/tagging:  Work,  Friends,  Family,  Hobby   + +Medical  diagrams:  Not  ill,  Cold,  Flu   + +Weather:  Sunny,  Cloudy,  Rain,  Snow   + +Andrew  Ng   + + Binary  classifica&on:   + +x2   + +Mul&-­‐class  classifica&on:   + +x2   + +x1   + +x1   + +Andrew  Ng   + + One-­‐vs-­‐all  (one-­‐vs-­‐rest):   + +x2   + +x2   + +x1   + +Class  1:   +Class  2:   +Class  3:   + +x2   + +x1   + +x1   +x2   + +x1   + +Andrew  Ng   + + One-­‐vs-­‐all   +Train  a  logis&c  regression  classifier                              for  each   +class        to  predict  the  probability  that                      .   +On  a  new  input        ,  to  make  a  predic&on,  pick  the   +class        that  maximizes   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture7.txt b/Lectures/mlclass/Lecture7.txt new file mode 100644 index 0000000..729a299 --- /dev/null +++ b/Lectures/mlclass/Lecture7.txt @@ -0,0 +1,184 @@ +Regulariza*on   +The  problem  of   +overfi6ng   +Machine  Learning   + + Size   + +Price   + +Price   + +Price   + +Example:  Linear  regression  (housing  prices)   + +Size   + +Size   + +Overfi&ng:  If  we  have  too  many  features,  the  learned  hypothesis   +may  fit  the  training  set  very  well  (                                                                                          ),  but   +fail  to  generalize  to  new  examples  (predict  prices  on  new   +Andrew  Ng   + + Example:  Logis*c  regression   +x2   + +x2   + +x2   + +x1   + +x1   + +x1   + +(        =  sigmoid  func*on)   + +Andrew  Ng   + + size  of  house   +no.  of  bedrooms   +no.  of  floors   +age  of  house   +average  income  in  neighborhood   +kitchen  size   + +Price   + +Addressing  overfi&ng:   + +Size   + +Andrew  Ng   + + Addressing  overfi&ng:   +Op*ons:   +1.  Reduce  number  of  features.   +―  Manually  select  which  features  to  keep.   +―  Model  selec*on  algorithm  (later  in  course).   +2.  Regulariza*on.   +―  Keep  all  the  features,  but  reduce  magnitude/values  of   +parameters        .   +―  Works  well  when  we  have  a  lot  of  features,  each  of   +which  contributes  a  bit  to  predic*ng        .   +   +   + +Andrew  Ng   + + Regulariza*on   +Cost  func*on   +Machine  Learning   + + Price   + +Price   + +Intui3on   + +Size  of  house   + +Size  of  house   + +Suppose  we  penalize  and  make          ,            really  small.   + +Andrew  Ng   + + Regulariza3on.   +Small  values  for  parameters     +―  “Simpler”  hypothesis   +―  Less  prone  to  overfi6ng     +Housing:   +―  Features:     +―  Parameters:   + +Andrew  Ng   + + Price   + +Regulariza3on.   + +Size  of  house   +Andrew  Ng   + + In  regularized  linear  regression,  we  choose            to  minimize   + +Price   + +What  if            is  set  to  an  extremely  large  value  (perhaps  for  too  large   +for  our  problem,  say                                    )?   + +Size  of  house   + +Andrew  Ng   + + Regulariza*on   +Regularized  linear   +regression   +Machine  Learning   + + Regularized  linear  regression   + + Gradient  descent   +Repeat   + +Andrew  Ng   + + Normal  equa3on   + +Andrew  Ng   + + Non-­‐inver3bility    (op3onal/advanced).   +Suppose                                  ,   +        (#examples)    (#features)   + +If                          ,   + +Andrew  Ng   + + Regulariza*on   +Regularized   +logis*c  regression   +Machine  Learning   + + Regularized  logis3c  regression.   +x2   + +x1   + +Cost  func*on:   + +Andrew  Ng   + + Gradient  descent   +Repeat   + +Andrew  Ng   + + Advanced  op3miza3on   +function [jVal, gradient] = costFunction(theta) +jVal = [ code  to  compute   + +]; + +gradient(1) = [code  to  compute   + +]; + +gradient(2) = [code  to  compute   + +]; + +gradient(3) = [code  to  compute   + +]; + +gradient(n+1) = [ code  to  compute   + +]; +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture8.txt b/Lectures/mlclass/Lecture8.txt new file mode 100644 index 0000000..3f7f488 --- /dev/null +++ b/Lectures/mlclass/Lecture8.txt @@ -0,0 +1,388 @@ +Neural  Networks:   +Representa1on   + +Non-­‐linear   +hypotheses   +Machine  Learning   + + Non-­‐linear  Classifica/on   + +x2   + +x1   + +size   +#  bedrooms   +#  floors   +age   + +Andrew  Ng   + + What  is  this?   +You  see  this:     + +But  the  camera  sees  this:   + +Andrew  Ng   + + Computer  Vision:  Car  detec/on   + +Cars   + +Not  a  car   + +Tes1ng:   +   +   +What  is  this?       + +Andrew  Ng   + + pixel  1 + +Learning   +Algorithm   +pixel  2 +Raw  image   +pixel  2   + +Cars   +“Non”-­‐Cars   + +pixel  1   + +Andrew  Ng   + + pixel  1 + +Learning   +Algorithm   +pixel  2 +Raw  image   +pixel  2   + +Cars   +“Non”-­‐Cars   + +pixel  1   + +Andrew  Ng   + + pixel  1 + +Learning   +Algorithm   +pixel  2 + +50  x  50  pixel  images→  2500  pixels   +                                                (7500  if  RGB)   + +Raw  image   +pixel  2   + +pixel  1  intensity   +pixel  2  intensity   + +Cars   +“Non”-­‐Cars   + +pixel  1   + +pixel  2500   +intensity   + +Quadra1c  features  (                              ):  ≈3  million +   +                                      features   + +Andrew  Ng   + + Neural  Networks:   +Representa1on   + +Neurons  and   +the  brain   +Machine  Learning   + + Neural  Networks   +Origins:  Algorithms  that  try  to  mimic  the  brain.   +Was  very  widely  used  in  80s  and  early  90s;  popularity   +diminished  in  late  90s.   +Recent  resurgence:  State-­‐of-­‐the-­‐art  technique  for  many   +applica1ons   + +Andrew  Ng   + + The  “one  learning  algorithm”  hypothesis   + +Auditory  Cortex   + +Auditory  cortex  learns  to  see   +   +[Roe  et  al.,  1992]   + +Andrew  Ng   + + The  “one  learning  algorithm”  hypothesis   + +Somatosensory  Cortex   + +Somatosensory  cortex  learns  to  see   +   +[Me1n  &  Frost,  1989]   + +Andrew  Ng   + + Sensor  representa/ons  in  the  brain   + +Seeing  with  your  tongue   + +Human  echoloca1on  (sonar)   + +Hap1c  belt:  Direc1on  sense   +[BrainPort;  Welsh  &  Blasch,  1997;  Nagel  et  al.,  2005;  Constan1ne-­‐Paton  &  Law,  2009]   + +Implan1ng  a  3rd  eye   + +Andrew  Ng   + + Neural  Networks:   +Representa1on   + +Model   +representa1on  I   +Machine  Learning   + + Neuron  in  the  brain   + +Andrew  Ng   + + Neurons  in  the  brain   + +[Credit:  US  Na1onal  Ins1tutes  of  Health,  Na1onal  Ins1tute  on  Aging]   + +Andrew  Ng   + + Neuron  model:  Logis/c  unit   + +Sigmoid  (logis1c)  ac1va1on  func1on.   + +Andrew  Ng   + + Neural  Network   + +Layer  1   + +Layer  2   + +Layer  3   + +Andrew  Ng   + + Neural  Network   + +“ac1va1on”  of  unit        in  layer     +matrix  of  weights  controlling   +func1on  mapping  from  layer        to   +layer   + +If  network  has            units  in  layer        ,                      units  in  layer                      ,  then   +will  be  of  dimension                                                              .   + +Andrew  Ng   + + Neural  Networks:   +Representa1on   + +Model   +representa1on  II   +Machine  Learning   + + Forward  propaga/on:  Vectorized  implementa/on   + +Add                            .   + +Andrew  Ng   + + Neural  Network  learning  its  own  features   + +Layer  1   + +Layer  2   + +Layer  3   + +Andrew  Ng   + + Other  network  architectures   + +Layer  1   + +Layer  2   + +Layer  3   + +Layer  4   + +Andrew  Ng   + + Neural  Networks:   +Representa1on   + +Examples  and   +intui1ons  I   +Machine  Learning   + + Non-­‐linear  classifica/on  example:  XOR/XNOR   +        ,            are  binary  (0  or  1).   +x2   + +x2   + +x1   +x1   + +Andrew  Ng   + + Simple  example:  AND   + +1.0   + +0   +0   +1   +1   + +0   +1   +0   +1   + +Andrew  Ng   + + Example:  OR  func/on   + +-­‐10   +20   +20   + +0   +0   +1   +1   + +0   +1   +0   +1   + +Andrew  Ng   + + Neural  Networks:   +Representa1on   + +Examples  and   +intui1ons  II   +Machine  Learning   + + Nega/on:   + +0   +1   + +Andrew  Ng   + + PuPng  it  together:     +-­‐30   + +10   + +-­‐10   + +20   + +-­‐20   + +20   + +20   + +-­‐20   + +20   + +0   +0   +1   +1   + +0   +1   +0   +1   + +Andrew  Ng   + + Neural  Network  intui/on   + +Layer  1   + +Layer  2   + +Layer  3   + +Layer  4   + +Andrew  Ng   + + HandwriRen  digit  classifica/on   + +[Courtesy  of  Yann  LeCun]   + +Andrew  Ng   + + Neural  Networks:   +Representa1on   + +Mul1-­‐class   +classifica1on   +Machine  Learning   + +Andrew  Ng   + + Mul/ple  output  units:  One-­‐vs-­‐all.   + +Pedestrian   + +Car   + +Motorcycle   + +Truck   + +Want                                            ,                                                  ,                                                ,      etc.   +when  pedestrian                  when  car + +  when  motorcycle   + +Andrew  Ng   + + Mul/ple  output  units:  One-­‐vs-­‐all.   + +Want                                            ,                                                  ,                                                ,      etc.   +when  pedestrian                  when  car + +  when  motorcycle   + +Training  set:     +              one  of                    , + +    ,                          , + +     + +pedestrian        car  motorcycle      truck   + +Andrew  Ng   + + \ No newline at end of file diff --git a/Lectures/mlclass/Lecture9.txt b/Lectures/mlclass/Lecture9.txt new file mode 100644 index 0000000..b7552a6 --- /dev/null +++ b/Lectures/mlclass/Lecture9.txt @@ -0,0 +1,313 @@ +Neural  Networks:   +Learning   + +Cost  func5on   +Machine  Learning   + + Neural  Network  (Classifica2on)   +total  no.  of  layers  in  network   +no.  of  units  (not  coun5ng  bias  unit)  in   +layer     +Layer  1   Layer  2   + +Layer  3   + +Binary  classifica5on   +     +   +    1  output  unit   +   + +Layer  4   + +   + +Mul5-­‐class  classifica5on  (K  classes)   +E.g.                      ,                          ,                                  ,   +pedestrian    car    motorcycle      truck   + +    K  output  units   +   + +Andrew  Ng   + + Cost  func2on   +Logis5c  regression:   +   +   +   +Neural  network:   + +Andrew  Ng   + + Neural  Networks:   +Learning   +Backpropaga5on   +algorithm   +Machine  Learning   + + Gradient  computa2on   + +Need  code  to  compute:   +-­‐      +-­‐      + + Gradient  computa2on   +Given  one  training  example  (      ,        ):   +Forward  propaga5on:   + +Layer  1   Layer  2   + +Layer  3   + +Layer  4   + + Gradient  computa2on:  Backpropaga2on  algorithm   +Intui5on:                            “error”  of  node        in  layer      .   +For  each  output  unit  (layer  L  =  4)   +Layer  1   Layer  2   + +Layer  3   + +Layer  4   + + Backpropaga2on  algorithm   +Training  set   + +Set                                        (for  all                    ).   + +For   +Set   +Perform  forward  propaga5on  to  compute                  for               +Using              ,  compute   +Compute     + + Neural  Networks:   +Learning   +Backpropaga5on   +intui5on   +Machine  Learning   + + Forward  Propaga2on   + + Forward  Propaga2on   + +Andrew  Ng   + + What  is  backpropaga2on  doing?   + +Focusing  on  a  single  example            ,              ,  the  case  of  1  output  unit,   +and  ignoring  regulariza5on  (                    ),   +(Think  of                                                                                          )             +I.e.  how  well  is  the  network  doing  on  example  i?   + +Andrew  Ng   + + Forward  Propaga2on   + +“error”  of  cost  for                  (unit        in  layer      ).     +Formally, + +   + +                            (for                      ),  where     + +Andrew  Ng   + + Neural  Networks:   +Learning   +Implementa5on   +note:  Unrolling   +parameters   + +Machine  Learning   + + Advanced  op2miza2on   +function [jVal, gradient] = costFunction(theta) + +…   +optTheta = fminunc(@costFunction, initialTheta, options) + +Neural  Network  (L=4):   +-­‐  matrices    (Theta1, Theta2, Theta3)   +   +   +  -­‐  matrices    (D1, D2, D3)   +“Unroll”  into  vectors   + +Andrew  Ng   + + Example   + +thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]; +DVec = [D1(:); D2(:); D3(:)]; +Theta1 = reshape(thetaVec(1:110),10,11); +Theta2 = reshape(thetaVec(111:220),10,11); +Theta3 = reshape(thetaVec(221:231),1,11); + +Andrew  Ng   + + Learning  Algorithm   + +Have  ini5al  parameters                                              .   +Unroll  to  get  initialTheta  to  pass  to   +fminunc(@costFunction, initialTheta, options) +function [jval, gradientVec] = costFunction(thetaVec) + +From  thetaVec, get                                                        .   +Use  forward  prop/back  prop  to  compute                                                         +and                  .   +Unroll                                                          to  get  gradientVec. + +Andrew  Ng   + + Neural  Networks:   +Learning   +Gradient  checking   +Machine  Learning   + + Numerical  es2ma2on  of  gradients   + +Implement:  gradApprox +EPSILON)) + += (J(theta + EPSILON) – J(theta – +/(2*EPSILON) + +Andrew  Ng   + + Parameter  vector     +(E.g.            is  “unrolled”  version  of                                                            )   + +Andrew  Ng   + + for i = 1:n, +thetaPlus = theta; +thetaPlus(i) = thetaPlus(i) + EPSILON; +thetaMinus = theta; +thetaMinus(i) = thetaMinus(i) – EPSILON; +gradApprox(i) = (J(thetaPlus) – J(thetaMinus)) +/(2*EPSILON); +end; + +Check  that  gradApprox  ≈  DVec + +Andrew  Ng   + + Implementa2on  Note:   +-­‐  Implement  backprop  to  compute  DVec  (unrolled                                                        ).   +-­‐  Implement  numerical  gradient  check  to  compute  gradApprox.   +-­‐  Make  sure  they  give  similar  values.   +-­‐  Turn  off  gradient  checking.  Using  backprop  code  for  learning.   +   +Important:   +-­‐  Be  sure  to  disable  your  gradient  checking  code  before  training     +your  classifier.  If  you  run  numerical  gradient  computa5on  on     +every  itera5on  of  gradient  descent  (or  in  the  inner  loop  of   +costFunction(…))your  code  will  be  very  slow.   + +Andrew  Ng   + + Neural  Networks:   +Learning   + +Random   +ini5aliza5on   +Machine  Learning   + + Ini2al  value  of   + +For  gradient  descent  and  advanced  op5miza5on   +method,  need  ini5al  value  for          .   +optTheta = fminunc(@costFunction, +initialTheta, options) + +Consider  gradient  descent   +Set   initialTheta +   +   +   = zeros(n,1) +   +                    ?   + +Andrew  Ng   + + Zero  ini2aliza2on   + +A_er  each  update,  parameters  corresponding  to  inputs  going  into  each  of   +two  hidden  units  are  iden5cal.     + +Andrew  Ng   + + Random  ini2aliza2on:  Symmetry  breaking   +Ini5alize  each                        to  a  random  value  in     +(i.e.                                                        )   +   +E.g.   +Theta1 = + +rand(10,11)*(2*INIT_EPSILON) +- INIT_EPSILON; + +Theta2 = + +rand(1,11)*(2*INIT_EPSILON) +- INIT_EPSILON; + +Andrew  Ng   + + Neural  Networks:   +Learning   + +Machine  Learning   + +Pu`ng  it   +together   + + Training  a  neural  network   +Pick  a  network  architecture  (connec5vity  paaern  between  neurons)   + +No.  of  input  units:  Dimension  of  features   +No.  output  units:  Number  of  classes   +Reasonable  default:  1  hidden  layer,  or  if  >1  hidden  layer,  have  same  no.  of   +hidden  units  in  every  layer  (usually  the  more  the  beaer)   + +Andrew  Ng   + + Training  a  neural  network   +1.  Randomly  ini5alize  weights   +2.  Implement  forward  propaga5on  to  get                              for  any       +3.  Implement  code  to  compute  cost  func5on   +4.  Implement  backprop  to  compute  par5al  deriva5ves   +for i = 1:m + +Perform  forward  propaga5on  and  backpropaga5on  using   +example   +(Get  ac5va5ons                and  delta  terms              for +                        ).   + +Andrew  Ng   + + Training  a  neural  network   +5.  Use  gradient  checking  to  compare                                      computed  using   +backpropaga5on  vs.  using    numerical  es5mate  of  gradient                     +of                    .   +Then  disable  gradient  checking  code.   +6.  Use  gradient  descent  or  advanced  op5miza5on  method  with   +backpropaga5on  to  try  to    minimize                    as  a  func5on  of   +parameters   + +Andrew  Ng   + + Andrew  Ng   + + Neural  Networks:   +Learning   +Backpropaga5on   +example:  Autonomous   +driving  (op5onal)   +Machine  Learning   + + [Courtesy  of  Dean  Pomerleau]   + + \ No newline at end of file diff --git a/Linear Regression/mlclass-ex1/octave-core b/Linear Regression/mlclass-ex1/octave-core deleted file mode 100644 index 184ecdd..0000000 Binary files a/Linear Regression/mlclass-ex1/octave-core and /dev/null differ diff --git a/Linear Regression/ex1.pdf b/LinearRegression/ex1.pdf similarity index 100% rename from Linear Regression/ex1.pdf rename to LinearRegression/ex1.pdf diff --git a/LinearRegression/ex1.txt b/LinearRegression/ex1.txt new file mode 100644 index 0000000..3c0c7a5 --- /dev/null +++ b/LinearRegression/ex1.txt @@ -0,0 +1,758 @@ +Programming Exercise 1: Linear Regression +Machine Learning + +Introduction +In this exercise, you will implement linear regression and get to see it work +on data. Before starting on this programming exercise, we strongly recommend watching the video lectures and completing the review questions for +the associated topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. +You can also find instructions for installing Octave on the “Octave Installation” page on the course website. + +Files included in this exercise +ex1.m - Octave script that will help step you through the exercise +ex1 multi.m - Octave script for the later parts of the exercise +ex1data1.txt - Dataset for linear regression with one variable +ex1data2.txt - Dataset for linear regression with multiple variables +submit.m - Submission script that sends your solutions to our servers +[ ] warmUpExercise.m - Simple example function in Octave +[ ] plotData.m - Function to display the dataset +[ ] computeCost.m - Function to compute the cost of linear regression +[ ] gradientDescent.m - Function to run gradient descent +[†] computeCostMulti.m - Cost function for multiple variables +[†] gradientDescentMulti.m - Gradient descent for multiple variables +[†] featureNormalize.m - Function to normalize features +[†] normalEqn.m - Function to compute the normal equations +indicates files you will need to complete +† indicates extra credit exercises +1 + + Throughout the exercise, you will be using the scripts ex1.m and ex1 multi.m. +These scripts set up the dataset for the problems and make calls to functions +that you will write. You do not need to modify either of them. You are only +required to modify functions in other files, by following the instructions in +this assignment. +For this programming exercise, you are only required to complete the first +part of the exercise to implement linear regression with one variable. The +second part of the exercise, which you may complete for extra credit, covers +linear regression with multiple variables. + +Where to get help +The exercises in this course use Octave,1 a high-level programming language +well-suited for numerical computations. If you do not have Octave installed, +please refer to the installation instructons at the “Octave Installation” page +http://www.ml-class.org/course/resources/index?page=octave-install on the +course website. +At the Octave command line, typing help followed by a function name +displays documentation for a built-in function. For example, help plot will +bring up help information for plotting. Further documentation for Octave +functions can be found at the Octave documentation pages. +We also strongly encourage using the online Q&A Forum to discuss +discuss exercises with other students. However, do not look at any source +code written by others or share your source code with others. + +1 + +Simple octave function + +The first part of ex1.m gives you practice with Octave syntax and the homework submission process. In the file warmUpExercise.m, you will find the +outline of an Octave function. Modify it to return a 5 x 5 identity matrix by +filling in the following code: +A = eye(5); +1 + +Octave is a free alternative to MATLAB. For the programming exercises, you are free +to use either Octave or MATLAB. + +2 + + When you are finished, run ex1.m (assuming you are in the correct directory, type “ex1” at the Octave prompt) and you should see output similar +to the following: + +ans = +Diagonal Matrix +1 +0 +0 +0 +0 + +0 +1 +0 +0 +0 + +0 +0 +1 +0 +0 + +0 +0 +0 +1 +0 + +0 +0 +0 +0 +1 + +Now ex1.m will pause until you press any key, and then will run the code +for the next part of the assignment. If you wish to quit, typing ctrl-c will +stop the program in the middle of its run. + +1.1 + +Submitting Solutions + +After completing a part of the exercise, you can submit your solutions for +grading by typing submit at the Octave command line. The submission +script will prompt you for your username and password and ask you which +files you want to submit. You can obtain a submission password from the +website’s “Programming Exercises” page. +You should now submit the warm up exercise. +You are allowed to submit your solutions multiple times, and we will take +only the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, October 23rd +at 23:59:59 PDT. + +2 + +Linear regression with one variable + +In this part of this exercise, you will implement linear regression with one +variable to predict profits for a food truck. Suppose you are the CEO of a +3 + + restaurant franchise and are considering different cities for opening a new +outlet. The chain already has trucks in various cities and you have data for +profits and populations from the cities. +You would like to use this data to help you select which city to expand +to next. +The file ex1data1.txt contains the dataset for our linear regression problem. The first column is the population of a city and the second column is +the profit of a food truck in that city. A negative value for profit indicates a +loss. +The ex1.m script has already been set up to load this data for you. + +2.1 + +Plotting the Data + +Before starting on any task, it is often useful to understand the data by +visualizing it. For this dataset, you can use a scatter plot to visualize the +data, since it has only two properties to plot (profit and population). (Many +other problems that you will encounter in real life are multi-dimensional and +can’t be plotted on a 2-d plot.) +In ex1.m, the dataset is loaded from the data file into the variables X +and y: +data = csvread('ex1data1.txt'); +X = data(:, 1); y = data(:, 2); +m = length(y); + +% read comma separated data +% number of training examples + +Next, the script calls the plotData function to create a scatter plot of +the data. Your job is to complete plotData.m to draw the plot; modify the +file and fill in the following code: +plot(x, y, 'rx', 'MarkerSize', 10); +ylabel('Profit in $10,000s'); +xlabel('Population of City in 10,000s'); + +% Plot the data +% Set the y−axis label +% Set the x−axis label + +Now, when you continue to run ex1.m, our end result should look like +Figure 1, with the same red “x” markers and axis labels. +To learn more about the plot command, you can type help plot at the +Octave command prompt or to search online for plotting documentation. (To +change the markers to red “x”, we used the option ‘rx’ together with the plot +command, i.e., plot(..,[your options here],.., ‘rx’); ) + +4 + + 25 + +20 + +Profit in $10,000s + +15 + +10 + +5 + +0 + +−5 + +4 + +6 + +8 + +10 + +12 +14 +16 +Population of City in 10,000s + +18 + +20 + +22 + +24 + +Figure 1: Scatter plot of training data + +2.2 + +Gradient Descent + +In this part, you will fit the linear regression parameters θ to our dataset +using gradient descent. +2.2.1 + +Update Equations + +The objective of linear regression is to minimize the cost function +J(θ) = + +1 +2m + +m + +hθ (x(i) ) − y (i) + +2 + +i=1 + +where the hypothesis hθ (x) is given by the linear model +hθ (x) = θT x = θ0 + θ1 x1 + +Recall that the parameters of your model are the θj values. These are +the values you will adjust to minimize cost J(θ). One way to do this is to +use the batch gradient descent algorithm. In batch gradient descent, each +iteration performs the update + +5 + + 1 +θj := θj − α +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj + +(simultaneously update θj for all j). + +i=1 + +With each step of gradient descent, your parameters θj come closer to the +optimal values that will achieve the lowest cost J(θ). +Implementation Note: We store each example as a row in the the X +matrix in Octave. To take into account the intercept term (θ0 ), we add +an additional first column to X and set it to all ones. This allows us to +treat θ0 as simply another ‘feature’. +2.2.2 + +Implementation + +In ex1.m, we have already already set up the data for linear regression. In +the following lines, we add another dimension to our data to accommodate +the θ0 intercept term. We also initialize the initial parameters to 0 and the +learning rate alpha to 0.01. +X = [ones(m, 1), data(:,1)]; % Add a column of ones to x +theta = zeros(2, 1); % initialize fitting parameters +iterations = 1500; +alpha = 0.01; + +2.2.3 + +Computing the cost J(θ) + +As you perform gradient descent to learn minimize the cost function J(θ), +it is helpful to monitor the convergence by computing the cost. In this +section, you will implement a function to calculate J(θ) so you can check the +convergence of your gradient descent implementation. +Your next task is to complete the code in the file computeCost.m, which +is a function that computes J(θ). As you are doing this, remember that the +variables X and y are not scalar values, but matrices whose rows represent +the examples from the training set. +Once you have completed the function, the next step in ex1.m will run +computeCost once using θ initialized to zeros, and you will see the cost +printed to the screen. +You should expect to see a cost of 32.07. + +6 + + You should now submit “compute cost” for linear regression with one +variable. +2.2.4 + +Gradient descent + +Next, you will implement gradient descent in the file gradientDescent.m. +The loop structure has been written for you, and you only need to supply +the updates to θ within each iteration. +As you program, make sure you understand what you are trying to optimize and what is being updated. Keep in mind that the cost J(θ) is parameterized by the vector θ, not X and y. That is, we minimize the value of J(θ) +by changing the values of the vector θ, not by changing X or y. Refer to the +equations in this handout and to the video lectures if you are uncertain. +A good way to verify that gradient descent is working correctly is to look +at the value of J(θ) and check that it is decreasing with each step. The +starter code for gradientDescent.m calls computeCost on every iteration +and prints the cost. Assuming you have implemented gradient descent and +computeCost correctly, your value of J(θ) should never increase, and should +converge to a steady value by the end of the algorithm. +After you are finished, ex1.m will use your final parameters to plot the +linear fit. The result should look something like Figure 2: +Your final values for θ will also be used to make predictions on profits in +areas of 35,000 and 70,000 people. Note the way that the following lines in +ex1.m uses matrix multiplication, rather than explicit summation or looping, to calculate the predictions. This is an example of code vectorization in +Octave. +You should now submit gradient descent for linear regression with one +variable. +predict1 = [1, 3.5] * theta; +predict2 = [1, 7] * theta; + +2.3 + +Debugging + +Here are some things to keep in mind as you implement gradient descent: +❼ Octave array indices start from one, not zero. If you’re storing θ0 and +θ1 in a vector called theta, the values will be theta(1) and theta(2). + +7 + + 25 + +20 + +Profit in $10,000s + +15 + +10 + +5 + +Training data + +0 + +Linear regression + +−5 + +4 + +6 + +8 + +10 + +12 +14 +16 +Population of City in 10,000s + +18 + +20 + +22 + +24 + +Figure 2: Training data with linear regression fit +❼ If you are seeing many errors at runtime, inspect your matrix operations +to make sure that you’re adding and multiplying matrices of compatible dimensions. Printing the dimensions of variables with the size +command will help you debug. +❼ By default, Octave interprets math operators to be matrix operators. +This is a common source of size incompatibility errors. If you don’t want +matrix multiplication, you need to add the “dot” notation to specify this +to Octave. For example, A*B does a matrix multiply, while A.*B does +an element-wise multiplication. + +2.4 + +Visualizing J(θ) + +To understand the cost function J(θ) better, you will now plot the cost over +a 2-dimensional grid of θ0 and θ1 values. You will not need to code anything +new for this part, but you should understand how the code you have written +already is creating these images. +In the next step of ex1.m, there is code set up to calculate J(θ) over a +grid of values using the computeCost function that you wrote. +% initialize J vals to a matrix of 0's + +8 + + J vals = zeros(length(theta0 vals), length(theta1 vals)); +% Fill out J vals +for i = 1:length(theta0 vals) +for j = 1:length(theta1 vals) +t = [theta0 vals(i); theta1 vals(j)]; +J vals(i,j) = computeCost(x, y, t); +end +end + +After these lines are executed, you will have a 2-D array of J(θ) values. +The script ex1.m will then use these values to produce surface and contour +plots of J(θ) using the surf and contour commands. The plots should look +something like Figure 3: +4 +3.5 +800 +3 +700 +2.5 + +600 +500 + +2 +θ1 + +400 +300 +200 + +1.5 +1 + +100 + +0.5 + +0 +4 + +0 +3 + +10 +2 + +5 +1 + +0 +0 + +θ1 + +−0.5 + +−5 +−1 + +−1 +−10 + +−10 + +−8 + +−6 + +−4 + +θ0 + +(a) Surface + +−2 + +0 +θ0 + +2 + +4 + +6 + +8 + +10 + +(b) Contour, showing minimum + +Figure 3: Cost function J(θ) +The purpose of these graphs is to show you that how J(θ) varies with +changes in θ0 and θ1 . The cost function J(θ) is bowl-shaped and has a global +mininum. (This is easier to see in the contour plot than in the 3D surface +plot). This minimum is the optimal point for θ0 and θ1 , and each step of +gradient descent moves closer to this point. + +9 + + Extra Credit Exercises (optional) +If you have successfully completed the material above, congratulations! You +now understand linear regression and should able to start using it on your +own datasets. +For the rest of this programming exercise, we have included the following +optional extra credit exercises. These exercises will help you gain a deeper +understanding of the material, and if you are able to do so, we encourage +you to complete them as well. + +3 + +Linear regression with multiple variables + +In this part, you will implement linear regression with multiple variables to +predict the prices of houses. Suppose you are selling your house and you +want to know what a good market price would be. One way to do this is to +first collect information on recent houses sold and make a model of housing +prices. +The file ex1data2.txt contains a training set of housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the +second column is the number of bedrooms, and the third column is the price +of the house. +The ex1 multi.m script has been set up to help you step through this +exercise. + +3.1 + +Feature Normalization + +The ex1 multi.m script will start by loading and displaying some values +from this dataset. By looking at the values, note that house sizes are about +1000 times the number of bedrooms. When features differ by orders of magnitude, first performing feature scaling can make gradient descent converge +much more quickly. +Your task here is to complete the code in featureNormalize.m to +❼ Subtract the mean value of each feature from the dataset. +❼ After subtracting the mean, additionally scale (divide) the feature values +by the inverse of their respective “standard deviations.” + +10 + + The standard deviation is a way of measuring how much variation there +is in the range of values of a particular feature (most data points will lie +within ±2 standard deviations of the mean); this is an alternative to taking +the range of values (max-min). In Octave, you can use the “std” function to +compute the standard deviation. For example, inside featureNormalize.m, +the quantity X(:,1) contains all the values of x1 (house sizes) in the training +set, so std(X(:,1)) computes the standard deviation of the house sizes. +At the time that featureNormalize.m is called, the extra column of 1’s +corresponding to x0 = 1 has not yet been added to X (see ex1 multi.m for +details). +You will do this for all the features and your code should work with +datasets of all sizes (any number of features / examples). Note that each +column of the matrix X corresponds to one feature. +You should now submit feature normalization. +Implementation Note: When normalizing the features, it is important +to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters +from the model, we often want to predict the prices of houses we have not +seen before. Given a new x value (living room area and number of bedrooms), we must first normalize x using the mean and standard deviation +that we had previously computed from the training set. + +3.2 + +Gradient Descent + +Previously, you implemented gradient descent on a univariate regression +problem. The only difference now is that there is one more feature in the +matrix X. The hypothesis function and the batch gradient descent update +rule remain unchanged. +You should complete the code in computeCostMulti.m and gradientDescentMulti.m +to implement the cost function and gradient descent for linear regression with +multiple variables. If your code in the previous part (single variable) already +supports multiple variables, you can use it here too. +Make sure your code supports any number of features and is well-vectorized. +You can use ‘size(X, 2)’ to find out how many features are present in the +dataset. +You should now submit compute cost and gradient descent for linear regression with multiple variables. + +11 + + Implementation Note: In the multivariate case, the cost function can +also be written in the following vectorized form: + +J(θ) = + +1 +(Xθ − y)T (Xθ − y) +2m + +where + + + +— (x(1) )T — + — (x(2) )T —  + + +X= + +.. + + +. +(m) T +— (x ) — + + + + +y= + + +y (1) +y (2) +.. +. +y (m) + + + + +. + + +The vectorized version is efficient when you’re working with numerical +computing tools like Octave. If you are an expert with matrix operations, +you can prove to yourself that the two forms are equivalent. +3.2.1 + +Optional (ungraded) exercise: Selecting learning rates + +In this part of the exercise, you will get to try out different learning rates for +the dataset and find a learning rate that converges quickly. You can change +the learning rate by modifying ex1 multi.m and changing the part of the +code that sets the learning rate. +The next phase in ex1 multi.m will call your gradientDescent.m function and run gradient descent for about 50 iterations at the chosen learning +rate. The function should also return the history of J(θ) values in a vector +J. After the last iteration, the ex1 multi.m script plots the J values against +the number of the iterations. +If you picked a learning rate within a good range, your plot look similar +Figure 4. If your graph looks very different, especially if your value of J(θ) +increases or even blows up, adjust your learning rate and try again. We recommend trying values of the learning rate α on a log-scale, at multiplicative +steps of about 3 times the previous value (i.e., 0.3, 0.1, 0.03, 0.01 and so on). +You may also want to adjust the number of iterations you are running if that +will help you see the overall trend in the curve. + +12 + + Figure 4: Convergence of gradient descent with an appropriate learning rate +Implementation Note: If your learning rate is too large, J(θ) can diverge and ‘blow up’, resulting in values which are too large for computer +calculations. In these situations, Octave will tend to return NaNs. NaN +stands for ‘not a number’ and is often caused by undefined operations +that involve −∞ and +∞. +Octave Tip: To compare how different learning learning rates affect +convergence, it’s helpful to plot J for several learning rates on the same +figure. In Octave, this can be done by performing gradient descent multiple times with a ‘hold on’ command between plots. Concretely, if you’ve +tried three different values of alpha (you should probably try more values +than this) and stored the costs in J1, J2 and J3, you can use the following +commands to plot them on the same figure: +plot(1:50, J1(1:50), ‘b’); +hold on; +plot(1:50, J2(1:50), ‘r’); +plot(1:50, J3(1:50), ‘k’); +The final arguments ‘b’, ‘r’, and ‘k’ specify different colors for the +plots. + +13 + + Notice the changes in the convergence curves as the learning rate changes. +With a small learning rate, you should find that gradient descent takes a very +long time to converge to the optimal value. Conversely, with a large learning +rate, gradient descent might not converge or might even diverge! +Using the best learning rate that you found, run the ex1 multi.m script +to run gradient descent until convergence to find the final values of θ. Next, +use this value of θ to predict the price of a house with 1650 square feet and +3 bedrooms. You will use value later to check your implementation of the +normal equations. Don’t forget to normalize your features when you make +this prediction! +You do not need to submit any solutions for these optional (ungraded) +exercises. + +3.3 + +Normal Equations + +In the lecture videos, you learned that the closed-form solution to linear +regression is +θ = XT X + +−1 + +X T y. + +Using this formula does not require any feature scaling, and you will get +an exact solution in one calculation: there is no “loop until convergence” like +in gradient descent. +Complete the code in normalEqn.m to use the formula above to calculate θ. Remember that while you don’t need to scale your features, we still +need to add a columns of 1’s to the X matrix to have an intercept term (θ0 ). +You should now submit the normal equations function. +Optional (ungraded) exercise: Now, once you have found θ using this +method, use it to make a price prediction for a 1650-square-foot house with +3 bedrooms. You should find that gives the same predicted price as the value +you obtained using the model fit with gradient descent (in Section 3.2.1). + +14 + + Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Part +Warm up exercise +Compute cost for one variable +Gradient descent for one variable +Total Points + +Submitted File +warmUpExercise.m +computeCost.m +gradientDescent.m + +Extra Credit Exercises (optional) +Feature normalization +featureNormalize.m +Compute cost for multiple computeCostMulti.m +variables +Gradient descent for multiple gradientDescentMulti.m +variables +Normal Equations +normalEqn.m + +Points +10 points +40 points +50 points +100 points + +10 points +15 points +15 points +10 points + +You are allowed to submit your solutions multiple times, and we will +take the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, October 23rd +at 23:59:59 PDT. + +15 + + \ No newline at end of file diff --git a/Linear Regression/mlclass-ex1/computeCost.m b/LinearRegression/mlclass-ex1/computeCost.m similarity index 100% rename from Linear Regression/mlclass-ex1/computeCost.m rename to LinearRegression/mlclass-ex1/computeCost.m diff --git a/Linear Regression/mlclass-ex1/computeCostMulti.m b/LinearRegression/mlclass-ex1/computeCostMulti.m similarity index 100% rename from Linear Regression/mlclass-ex1/computeCostMulti.m rename to LinearRegression/mlclass-ex1/computeCostMulti.m diff --git a/Linear Regression/mlclass-ex1/ex1.m b/LinearRegression/mlclass-ex1/ex1.m similarity index 100% rename from Linear Regression/mlclass-ex1/ex1.m rename to LinearRegression/mlclass-ex1/ex1.m diff --git a/Linear Regression/mlclass-ex1/ex1_multi.m b/LinearRegression/mlclass-ex1/ex1_multi.m similarity index 100% rename from Linear Regression/mlclass-ex1/ex1_multi.m rename to LinearRegression/mlclass-ex1/ex1_multi.m diff --git a/Linear Regression/mlclass-ex1/ex1data1.txt b/LinearRegression/mlclass-ex1/ex1data1.txt similarity index 100% rename from Linear Regression/mlclass-ex1/ex1data1.txt rename to LinearRegression/mlclass-ex1/ex1data1.txt diff --git a/Linear Regression/mlclass-ex1/ex1data2.txt b/LinearRegression/mlclass-ex1/ex1data2.txt similarity index 100% rename from Linear Regression/mlclass-ex1/ex1data2.txt rename to LinearRegression/mlclass-ex1/ex1data2.txt diff --git a/Linear Regression/mlclass-ex1/featureNormalize.m b/LinearRegression/mlclass-ex1/featureNormalize.m similarity index 100% rename from Linear Regression/mlclass-ex1/featureNormalize.m rename to LinearRegression/mlclass-ex1/featureNormalize.m diff --git a/Linear Regression/mlclass-ex1/gradientDescent.m b/LinearRegression/mlclass-ex1/gradientDescent.m similarity index 100% rename from Linear Regression/mlclass-ex1/gradientDescent.m rename to LinearRegression/mlclass-ex1/gradientDescent.m diff --git a/Linear Regression/mlclass-ex1/gradientDescentMulti.m b/LinearRegression/mlclass-ex1/gradientDescentMulti.m similarity index 100% rename from Linear Regression/mlclass-ex1/gradientDescentMulti.m rename to LinearRegression/mlclass-ex1/gradientDescentMulti.m diff --git a/Linear Regression/mlclass-ex1/normalEqn.m b/LinearRegression/mlclass-ex1/normalEqn.m similarity index 100% rename from Linear Regression/mlclass-ex1/normalEqn.m rename to LinearRegression/mlclass-ex1/normalEqn.m diff --git a/Linear Regression/mlclass-ex1/plotData.m b/LinearRegression/mlclass-ex1/plotData.m similarity index 100% rename from Linear Regression/mlclass-ex1/plotData.m rename to LinearRegression/mlclass-ex1/plotData.m diff --git a/Linear Regression/mlclass-ex1/submit.m b/LinearRegression/mlclass-ex1/submit.m similarity index 100% rename from Linear Regression/mlclass-ex1/submit.m rename to LinearRegression/mlclass-ex1/submit.m diff --git a/Linear Regression/mlclass-ex1/warmUpExercise.m b/LinearRegression/mlclass-ex1/warmUpExercise.m similarity index 100% rename from Linear Regression/mlclass-ex1/warmUpExercise.m rename to LinearRegression/mlclass-ex1/warmUpExercise.m diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/octave-core b/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/octave-core deleted file mode 100644 index 1fc2a79..0000000 Binary files a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/octave-core and /dev/null differ diff --git a/Logistic Regression/mlclass-ex2/octave-core b/Logistic Regression/mlclass-ex2/octave-core deleted file mode 100644 index ea55231..0000000 Binary files a/Logistic Regression/mlclass-ex2/octave-core and /dev/null differ diff --git a/Logistic Regression/ex2.pdf b/LogisticRegression/ex2.pdf similarity index 100% rename from Logistic Regression/ex2.pdf rename to LogisticRegression/ex2.pdf diff --git a/LogisticRegression/ex2.txt b/LogisticRegression/ex2.txt new file mode 100644 index 0000000..f65e0ce --- /dev/null +++ b/LogisticRegression/ex2.txt @@ -0,0 +1,705 @@ +Programming Exercise 2: Logistic Regression +Machine Learning +October 20, 2011 + +Introduction +In this exercise, you will implement logistic regression and apply it to two +different datasets. Before starting on the programming exercise, we strongly +recommend watching the video lectures and completing the review questions +for the associated topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. +You can also find instructions for installing Octave on the “Octave Installation” page on the course website. + +Files included in this exercise +ex2.m - Octave script that will help step you through the exercise +ex2 reg.m - Octave script for the later parts of the exercise +ex2data1.txt - Training set for the first half of the exercise +ex2data2.txt - Training set for the second half of the exercise +submit.m - Submission script that sends your solutions to our servers +mapFeature.m - Function to generate polynomial features +plotDecisionBounday.m - Function to plot classifier’s decision boundary +[ ] plotData.m - Function to plot 2D classification data +[ ] sigmoid.m - Sigmoid Function +[ ] costFunction.m - Logistic Regression Cost Function +[ ] predict.m - Logistic Regression Prediction Function +[ ] costFunctionReg.m - Regularized Logistic Regression Cost +indicates files you will need to complete +1 + + Throughout the exercise, you will be using the scripts ex2.m and ex2 reg.m. +These scripts set up the dataset for the problems and make calls to functions +that you will write. You do not need to modify either of them. You are only +required to modify functions in other files, by following the instructions in +this assignment. + +Where to get help +The exercises in this course use Octave,1 a high-level programming language +well-suited for numerical computations. If you do not have Octave installed, +please refer to the installation instructons at the “Octave Installation” page +http://www.ml-class.org/course/resources/index?page=octave-install on the +course website. +At the Octave command line, typing help followed by a function name +displays documentation for a built-in function. For example, help plot will +bring up help information for plotting. Further documentation for Octave +functions can be found at the Octave documentation pages. +We also strongly encourage using the online Q&A Forum to discuss +exercises with other students. However, do not look at any source code +written by others or share your source code with others. + +1 + +Logistic Regression + +In this part of the exercise, you will build a logistic regression model to +predict whether a student gets admitted into a university. +Suppose that you are the administrator of a university department and +you want to determine each applicant’s chance of admission based on their +results on two exams. You have historical data from previous applicants +that you can use as a training set for logistic regression. For each training +example, you have the applicant’s scores on two exams and the admissions +decision. +Your task is to build a classification model that estimates an applicant’s +probability of admission based the scores from those two exams. This outline +and the framework code in ex2.m will guide you through the exercise. +1 + +Octave is a free alternative to MATLAB. For the programming exercises, you are free +to use either Octave or MATLAB. + +2 + + 1.1 + +Visualizing the data + +Before starting to implement any learning algorithm, it is always good to +visualize the data if possible. In the first part of ex2.m, the code will load the +data and display it on a 2-dimensional plot by calling the function plotData. +You will now complete the code in plotData so that it displays a figure +like Figure 1, where the axes are the two exam scores, and the positive and +negative examples are shown with different markers. +100 +Admitted +Not admitted +90 + +Exam 2 score + +80 + +70 + +60 + +50 + +40 + +30 +30 + +40 + +50 + +60 +70 +Exam 1 score + +80 + +90 + +100 + +Figure 1: Scatter plot of training data +To help you get more familiar with plotting, we have left plotData.m +empty so you can try to implement it yourself. However, this is an optional +(ungraded) exercise. We also provide our implementation below so you can +copy it or refer to it. If you choose to copy our example, make sure you learn +what each of its commands is doing by consulting the Octave documentation. +% Find Indices of Positive and Negative Examples +pos = find(y==1); neg = find(y == 0); +% Plot Examples +plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, ... +'MarkerSize', 7); +plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', ... +'MarkerSize', 7); + +3 + + 1.2 +1.2.1 + +Implementation +Warmup exercise: sigmoid function + +Before you start with the actual cost function, recall that the logistic regression hypothesis is defined as: +hθ (x) = g(θT x), +where function g is the sigmoid function. The sigmoid function is defined as: +g(z) = + +1 +. +1 + e−z + +Your first step is to implement this function in sigmoid.m so it can be +called by the rest of your program. When you are finished, try testing a few +values by calling sigmoid(x) at the octave command line. For large positive +values of x, the sigmoid should be close to 1, while for large negative values, +the sigmoid should be close to 0. Evaluating sigmoid(0) should give you +exactly 0.5. Your code should also work with vectors and matrices. For +a matrix, your function should perform the sigmoid function on +every element. +You can submit your solution for grading by typing submit at the Octave +command line. The submission script will prompt you for your username and +password and ask you which files you want to submit. You can obtain a submission password from the website. +You should now submit the warm up exercise. +1.2.2 + +Cost function and gradient + +Now you will implement the cost function and gradient for logistic regression. +Complete the code in costFunction.m to return the cost and gradient. +Recall that the cost function in logistic regression is +1 +J(θ) = +m + +m + +−y (i) log(hθ (x(i) )) − (1 − y (i) ) log(1 − hθ (x(i) )) , +i=1 + +and the gradient of the cost is a vector θ where the j th element (for j = +0, 1, . . . , n) is defined as follows: +4 + + 1 +∂J(θ) += +∂θj +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj +i=1 + +Note that while this gradient looks identical to the linear regression gradient, the formula is actually different because linear and logistic regression +have different definitions of hθ (x). +Once you are done, ex2.m will call your costFunction using the initial +parameters of θ. You should see that the cost is about 0.693. +You should now submit the cost function and gradient for logistic regression. Make two submissions: one for the cost function and one for the +gradient. + +1.2.3 + +Learning parameters using fminunc + +In the previous assignment, you found the optimal parameters of a linear +regression model by implementing gradent descent. You wrote a cost function +and calculated its gradient, then took a gradient descent step accordingly. +This time, instead of taking gradient descent steps, you will use an Octave +built-in function called fminunc. +Octave’s fminunc is an optimization solver that finds the minimum of an +unconstrained2 function. For logistic regression, you want to optimize the +cost function J(θ) with parameters θ. +Concretely, you are going to use fminunc to find the best parameters θ +for the logistic regression cost function, given a fixed dataset (of X and y +values). You will pass to fminunc the following inputs: +❼ The initial values of the parameters we are trying to optimize. +❼ A function that, when given the training set and a particular θ, computes +the logistic regression cost and gradient with respect to θ for the dataset +(X, y) + +In ex2.m, we already have code written to call fminunc with the correct +arguments. +2 + +Constraints in optimization often refer to constraints on the parameters, for example, +constraints that bound the possible values θ can take (e.g., θ ≤ 1). Logistic regression +does not have such constraints since θ is allowed to take any real value. + +5 + + % Set options for fminunc +options = optimset('GradObj', 'on', 'MaxIter', 400); +% Run fminunc to obtain the optimal theta +% This function will return theta and the cost +[theta, cost] = ... +fminunc(@(t)(costFunction(t, X, y)), initial theta, options); + +In this code snippet, we first defined the options to be used with fminunc. +Specifically, we set the GradObj option to on, which tells fminunc that our +function returns both the cost and the gradient. This allows fminunc to +use the gradient when minimizing the function. Furthermore, we set the +MaxIter option to 400, so that fminunc will run for at most 400 steps before +it terminates. +To specify the actual function we are minimizing, we use a “short-hand” +for specifying functions with the @(t) ( costFunction(t, X, y) ) . This +creates a function, with argument t, which calls your costFunction. This +allows us to wrap the costFunction for use with fminunc. +If you have completed the costFunction correctly, fminunc will converge +on the right optimization parameters and return the final values of the cost +and θ. Notice that by using fminunc, you did not have to write any loops +yourself, or set a learning rate like you did for gradient descent. This is all +done by fminunc: you only needed to provide a function calculating the cost +and the gradient. +Once fminunc completes, ex2.m will call your costFunction function +using the optimal parameters of θ. You should see that the cost is about +0.203. +This final θ value will then be used to plot the decision boundary on the +training data, resulting in a figure similar to Figure 2. We also encourage +you to look at the code in plotDataBoundary.m to see how to plot such a +boundary using the θ values. +1.2.4 + +Evaluating logistic regression + +After learning the parameters, you can use the model to predict whether a +particular student will be admitted. For a student with an Exam 1 score +of 45 and an Exam 2 score of 85, you should expect to see an admission +probability of 0.774. +Another way to evaluate the quality of the parameters we have found +is to see how well the learned model predicts on our training set. In this +part, your task is to complete the code in predict.m. The predict function +6 + + 100 +Admitted +Not admitted +90 + +Exam 2 score + +80 + +70 + +60 + +50 + +40 + +30 +30 + +40 + +50 + +60 +70 +Exam 1 score + +80 + +90 + +100 + +Figure 2: Training data with decision boundary +will produce “1” or “0” predictions given a dataset and a learned parameter +vector θ. +After you have completed the code in predict.m, the ex2.m script will +proceed to report the training accuracy of your classifier by computing the +percentage of examples it got correct. +You should now submit the prediction function for logistic regression. + +2 + +Regularized logistic regression + +In this part of the exercise, you will implement regularized logistic regression +to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure +it is functioning correctly. +Suppose you are the product manager of the factory and you have the +test results for some microchips on two different tests. From these two tests, +you would like to determine whether the microchips should be accepted or +rejected. To help you make the decision, you have a dataset of test results +on past microchips, from which you can build a logistic regression model. + +7 + + You will use another script, ex2 reg.m to complete this portion of the +exercise. + +2.1 + +Visualizing the data + +Similar to the previous parts of this exercise, plotData is used to generate a +figure like Figure 3, where the axes are the two test scores, and the positive +(y = 1, accepted) and negative (y = 0, rejected) examples are shown with +different markers. +1.2 +y=1 +y=0 + +1 +0.8 + +Microchip Test 2 + +0.6 +0.4 +0.2 +0 +−0.2 +−0.4 +−0.6 +−0.8 +−1 + +−0.5 + +0 + +0.5 +Microchip Test 1 + +1 + +1.5 + +Figure 3: Plot of training data +Figure 3 shows that our dataset cannot be separated into positive and +negative examples by a straight-line through the plot. Therefore, a straightforward application of logistic regression will not perform well on this dataset +since logistic regression will only be able to find a linear decision boundary. + +2.2 + +Feature mapping + +One way to fit the data better is to create more features from each data +point. In the provided function mapFeature.m, we will map the features into +all polynomial terms of x1 and x2 up to the sixth power. + +8 + +  + +1 +x1 +x2 +x21 +x1 x2 +x22 +x31 +.. +. + + + + + + + + + +mapFeature(x) =  + + + + + + + x1 x52 +x62 + + + + + + + + + + + + + + + + + + + +As a result of this mapping, our vector of two features (the scores on +two QA tests) has been transformed into a 28-dimensional vector. A logistic +regression classifier trained on this higher-dimension feature vector will have +a more complex decision boundary and will appear nonlinear when drawn in +our 2-dimensional plot. +While the feature mapping allows us to build a more expressive classifier, +it also more susceptible to overfitting. In the next parts of the exercise, you +will implement regularized logistic regression to fit the data and also see for +yourself how regularization can help combat the overfitting problem. + +2.3 + +Cost function and gradient + +Now you will implement code to compute the cost function and gradient for +regularized logistic regression. Complete the code in costFunctionReg.m to +return the cost and gradient. +Recall that the regularized cost function in logistic regression is +1 +J(θ) = +m + +m + +−y +i=1 + +(i) + +λ +log(hθ (x )) − (1 − y ) log(1 − hθ (x )) + +2m +(i) + +(i) + +n + +(i) + +θj2 . +j=1 + +Note that you should not regularize the parameter θ0 ; thus, the final +summation above is for j = 1 to n, not j = 0 to n. The gradient of the cost +function is a vector where the j th element is defined as follows: +∂J(θ) +1 += +∂θ0 +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj +i=1 + +9 + +for j = 0 + + 1 +∂J(θ) += +∂θj +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj + λθj + +for j ≥ 1 + +i=1 + +Once you are done, ex2 reg.m will call your costFunctionReg function +using the initial value of θ (initialized to all zeros). You should see that the +cost is about 0.693. +You should now submit the cost function and gradient for regularized logistic regression. Make two submissions, one for the cost function and one +for the gradient. + +2.3.1 + +Learning parameters using fminunc + +Similar to the previous parts, you will use fminunc to learn the optimal +parameters θ. If you have completed the cost and gradient for regularized +logistic regression (costFunctionReg.m) correctly, you should be able to step +through the next part of ex2 reg.m to learn the parameters θ using fminunc. + +2.4 + +Plotting the decision boundary + +To help you visualize the model learned by this classifier, we have provided the function plotDecisionBoundary.m which plots the (non-linear) +decision boundary that separates the positive and negative examples. In +plotDecisionBoundary.m, we plot the non-linear decision boundary by computing the classifier’s predictions on an evenly spaced grid and then and drew +a contour plot of where the predictions change from y = 0 to y = 1. +After learning the parameters θ, the next step in ex reg.m will plot a +decision boundary similar to Figure 4. + +10 + + 2.5 + +Optional (ungraded) exercises + +In this part of the exercise, you will get to try out different regularization +parameters for the dataset to understand how regularization prevents overfitting. +Notice the changes in the decision boundary as you vary λ. With a small +λ, you should find that the classifier gets almost every training example +correct, but draws a very complicated boundary, thus overfitting the data +(Figure 5). This is not a good decision boundary: for example, it predicts +that a point at x = (−0.25, 1.5) is accepted (y = 1), which seems to be an +incorrect decision given the training set. +With a larger λ, you should see a plot that shows an simpler decision +boundary which still separates the positives and negatives fairly well. However, if λ is set to too high a value, you will not get a good fit and the decision +boundary will not follow the data so well, thus underfitting the data (Figure +6). +You do not need to submit any solutions for these optional (ungraded) +exercises. +lambda = 1 + +1.2 + +y=1 +y=0 +Decision boundary + +1 +0.8 + +Microchip Test 2 + +0.6 +0.4 +0.2 +0 +−0.2 +−0.4 +−0.6 +−0.8 +−1 + +−0.5 + +0 + +0.5 +Microchip Test 1 + +1 + +1.5 + +Figure 4: Training data with decision boundary (λ = 1) + +11 + + lambda = 0 + +1.5 + +y=1 +y=0 +Decision boundary + +Microchip Test 2 + +1 + +0.5 + +0 + +−0.5 + +−1 +−1 + +−0.5 + +0 + +0.5 +Microchip Test 1 + +1 + +1.5 + +Figure 5: No regularization (Overfitting) (λ = 0) + +lambda = 100 + +1.2 + +y=1 +y=0 +Decision boundary + +1 +0.8 + +Microchip Test 2 + +0.6 +0.4 +0.2 +0 +−0.2 +−0.4 +−0.6 +−0.8 +−1 + +−0.5 + +0 + +0.5 +Microchip Test 1 + +1 + +1.5 + +Figure 6: Too much regularization (Underfitting) (λ = 100) + +12 + + Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Part +Sigmoid Function +Compute cost for logistic regression +Gradient for logistic regression +Predict Function +Compute cost for regularized LR +Gradient for regularized LR +Total Points + +Submitted File +sigmoid.m +costFunction.m +costFunction.m +predict.m +costFunctionReg.m +costFunctionReg.m + +Points +5 points +30 points +30 points +5 points +15 points +15 points +100 points + +You are allowed to submit your solutions multiple times, and we will +take the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, October 30th +at 23:59:59 PDT. + +13 + + \ No newline at end of file diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/costFunctionReg.m b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/costFunctionReg.m similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/costFunctionReg.m rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/costFunctionReg.m diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/credit.m b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/credit.m similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/credit.m rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/credit.m diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/cs-test.csv b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/cs-test.csv similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/cs-test.csv rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/cs-test.csv diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/cs-training.csv b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/cs-training.csv similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/cs-training.csv rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/cs-training.csv diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/featureNormalize.m b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/featureNormalize.m similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/featureNormalize.m rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/featureNormalize.m diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/p.csv b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/p.csv similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/p.csv rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/p.csv diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/predict.m b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/predict.m similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/predict.m rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/predict.m diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/result.csv b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/result.csv similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/result.csv rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/result.csv diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/result1.csv b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/result1.csv similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/result1.csv rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/result1.csv diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/sampleEntry.csv b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/sampleEntry.csv similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/sampleEntry.csv rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/sampleEntry.csv diff --git a/Logistic Regression/mlclass-ex2/GiveMeSomeCredit/sigmoid.m b/LogisticRegression/mlclass-ex2/GiveMeSomeCredit/sigmoid.m similarity index 100% rename from Logistic Regression/mlclass-ex2/GiveMeSomeCredit/sigmoid.m rename to LogisticRegression/mlclass-ex2/GiveMeSomeCredit/sigmoid.m diff --git a/Logistic Regression/mlclass-ex2/costFunction.m b/LogisticRegression/mlclass-ex2/costFunction.m similarity index 100% rename from Logistic Regression/mlclass-ex2/costFunction.m rename to LogisticRegression/mlclass-ex2/costFunction.m diff --git a/Logistic Regression/mlclass-ex2/costFunctionReg.m b/LogisticRegression/mlclass-ex2/costFunctionReg.m similarity index 100% rename from Logistic Regression/mlclass-ex2/costFunctionReg.m rename to LogisticRegression/mlclass-ex2/costFunctionReg.m diff --git a/Logistic Regression/mlclass-ex2/ex2.m b/LogisticRegression/mlclass-ex2/ex2.m similarity index 100% rename from Logistic Regression/mlclass-ex2/ex2.m rename to LogisticRegression/mlclass-ex2/ex2.m diff --git a/Logistic Regression/mlclass-ex2/ex2_reg.m b/LogisticRegression/mlclass-ex2/ex2_reg.m similarity index 100% rename from Logistic Regression/mlclass-ex2/ex2_reg.m rename to LogisticRegression/mlclass-ex2/ex2_reg.m diff --git a/Logistic Regression/mlclass-ex2/ex2data1.txt b/LogisticRegression/mlclass-ex2/ex2data1.txt similarity index 100% rename from Logistic Regression/mlclass-ex2/ex2data1.txt rename to LogisticRegression/mlclass-ex2/ex2data1.txt diff --git a/Logistic Regression/mlclass-ex2/ex2data2.txt b/LogisticRegression/mlclass-ex2/ex2data2.txt similarity index 100% rename from Logistic Regression/mlclass-ex2/ex2data2.txt rename to LogisticRegression/mlclass-ex2/ex2data2.txt diff --git a/Logistic Regression/mlclass-ex2/mapFeature.m b/LogisticRegression/mlclass-ex2/mapFeature.m similarity index 100% rename from Logistic Regression/mlclass-ex2/mapFeature.m rename to LogisticRegression/mlclass-ex2/mapFeature.m diff --git a/Logistic Regression/mlclass-ex2/plotData.m b/LogisticRegression/mlclass-ex2/plotData.m similarity index 100% rename from Logistic Regression/mlclass-ex2/plotData.m rename to LogisticRegression/mlclass-ex2/plotData.m diff --git a/Logistic Regression/mlclass-ex2/plotDecisionBoundary.m b/LogisticRegression/mlclass-ex2/plotDecisionBoundary.m similarity index 100% rename from Logistic Regression/mlclass-ex2/plotDecisionBoundary.m rename to LogisticRegression/mlclass-ex2/plotDecisionBoundary.m diff --git a/Logistic Regression/mlclass-ex2/predict.m b/LogisticRegression/mlclass-ex2/predict.m similarity index 100% rename from Logistic Regression/mlclass-ex2/predict.m rename to LogisticRegression/mlclass-ex2/predict.m diff --git a/Logistic Regression/mlclass-ex2/sigmoid.m b/LogisticRegression/mlclass-ex2/sigmoid.m similarity index 100% rename from Logistic Regression/mlclass-ex2/sigmoid.m rename to LogisticRegression/mlclass-ex2/sigmoid.m diff --git a/Logistic Regression/mlclass-ex2/submit.m b/LogisticRegression/mlclass-ex2/submit.m similarity index 100% rename from Logistic Regression/mlclass-ex2/submit.m rename to LogisticRegression/mlclass-ex2/submit.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/octave-core b/Multi-class classification and neural networks/mlclass-ex3/octave-core deleted file mode 100644 index 26394f7..0000000 Binary files a/Multi-class classification and neural networks/mlclass-ex3/octave-core and /dev/null differ diff --git a/Multi-class classification and neural networks/ex3.pdf b/MultiClassclassificationandNeuralNetworks/ex3.pdf similarity index 100% rename from Multi-class classification and neural networks/ex3.pdf rename to MultiClassclassificationandNeuralNetworks/ex3.pdf diff --git a/MultiClassclassificationandNeuralNetworks/ex3.txt b/MultiClassclassificationandNeuralNetworks/ex3.txt new file mode 100644 index 0000000..adb7472 --- /dev/null +++ b/MultiClassclassificationandNeuralNetworks/ex3.txt @@ -0,0 +1,631 @@ +Programming Exercise 3: +Multi-class Classification and Neural Networks +Machine Learning +October 25, 2011 + +Introduction +In this exercise, you will implement one-vs-all logistic regression and neural +networks to recognize hand-written digits. Before starting the programming +exercise, we strongly recommend watching the video lectures and completing +the review questions for the associated topics. +To get started with the exercise, download the starter code and unzip its +contents to the directory where you wish to complete the exercise. If needed, +use the cd command in Octave to change to this directory before starting +this exercise. + +Files included in this exercise +ex3.m - Octave script that will help step you through part 1 +ex3 nn.m - Octave script that will help step you through part 2 +ex3data1.mat - Training set of hand-written digits +ex3weights.mat - Initial weights for the neural network exercise +submitWeb.m - Alternative submission script +submit.m - Submission script that sends your solutions to our servers +displayData.m - Function to help visualize the dataset +fmincg.m - Function minimization routine (similar to fminunc) +sigmoid.m - Sigmoid function +[ ] lrCostFunction.m - Logistic regression cost function +[ ] oneVsAll.m - Train a one-vs-all multi-class classifier +[ ] predictOneVsAll.m - Predict using a one-vs-all multi-class classifier +[ ] predict.m - Neural network prediction function + +1 + + indicates files you will need to complete +Throughout the exercise, you will be using the scripts ex3.m and ex3 nn.m. +These scripts set up the dataset for the problems and make calls to functions +that you will write. You do not need to modify these scripts. You are only +required to modify functions in other files, by following the instructions in +this assignment. + +Where to get help +We also strongly encourage using the online Q&A Forum to discuss exercises with other students. However, do not look at any source code written +by others or share your source code with others. +If you run into network errors using the submit script, you can also use +an online form for submitting your solutions. To use this alternative submission interface, run the submitWeb script to generate a submission file (e.g., +submit ex2 part1.txt). You can then submit this file through the web +submission form in the programming exercises page (go to the programming +exercises page, then select the exercise you are submitting for). If you are +having no problems submitting through the standard submission system using the submit script, you do not need to use this alternative submission +interface. + +1 + +Multi-class Classification + +For this exercise, you will use logistic regression and neural networks to +recognize handwritten digits (from 0 to 9). Automated handwritten digit +recognition is widely used today - from recognizing zip codes (postal codes) +on mail envelopes to recognizing amounts written on bank checks. This +exercise will show you how the methods you’ve learned can be used for this +classification task. +In the first part of the exercise, you will extend your previous implemention of logistic regression and apply it to one-vs-all classification. + +2 + + 1.1 + +Dataset + +You are given a data set in ex3data1.mat that contains 5000 training examples of handwritten digits.1 The .mat format means that that the data +has been saved in a native Octave/Matlab matrix format, instead of a text +(ASCII) format like a csv-file. These matrices can be read directly into your +program by using the load command. After loading, matrices of the correct +dimensions and values will appear in your program’s memory. The matrix +will already be named, so you do not need to assign names to them. +% Load saved matrices from file +load('ex3data1.mat'); +% The matrices X and y will now be in your Octave environment + +There are 5000 training examples in ex3data1.mat, where each training +example is a 20 pixel by 20 pixel grayscale image of the digit. Each pixel is +represented by a floating point number indicating the grayscale intensity at +that location. The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional +vector. Each of these training examples become a single row in our data +matrix X. This gives us a 5000 by 400 matrix X where every row is a training +example for a handwritten digit image. + + +— (x(1) )T — + — (x(2) )T —  + + +X= + +.. + + +. +(m) T +— (x ) — +The second part of the training set is a 5000-dimensional vector y that +contains labels for the training set. To make things more compatible with +Octave/Matlab indexing, where there is no zero index, we have mapped the +digit zero to the value ten. Therefore, a “0” digit is labeled as “10”, while +the digits “1” to “9” are labeled as “1” to “9” in their natural order. + +1.2 + +Visualizing the data + +You will begin by visualizing a subset of the training set. In Part 1 of ex3.m, +the code randomly selects selects 100 rows from X and passes those rows +to the displayData function. This function maps each row to a 20 pixel by +20 pixel grayscale image and displays the images together. We have provided +1 +This is a subset of the MNIST handwritten digit dataset (http://yann.lecun.com/ +exdb/mnist/). + +3 + + the displayData function, and you are encouraged to examine the code to +see how it works. After you run this step, you should see an image like Figure +1. + +Figure 1: Examples from the dataset + +1.3 + +Vectorizing Logistic Regression + +You will be using multiple one-vs-all logistic regression models to build a +multi-class classifier. Since there are 10 classes, you will need to train 10 +separate logistic regression classifiers. To make this training efficient, it is +important to ensure that your code is well vectorized. In this section, you +will implement a vectorized version of logistic regression that does not employ +any for loops. You can use your code in the last exercise as a starting point +for this exercise. +1.3.1 + +Vectorizing the cost function + +We will begin by writing a vectorized version of the cost function. Recall +that in (unregularized) logistic regression, the cost function is +1 +J(θ) = +m + +m + +−y (i) log(hθ (x(i) )) − (1 − y (i) ) log(1 − hθ (x(i) )) . +i=1 + +To compute each element in the summation, we have to compute hθ (x(i) ) +for every example i, where hθ (x(i) ) = g(θT x(i) ) and g(z) = 1+e1−z is the +4 + + sigmoid function. It turns out that we can compute this quickly for all our +examples by using matrix multiplication. Let us define X and θ as + + + + +— (x(1) )T — +θ0 + — (x(2) )T —  + θ1  + + + + +X= +and +θ += + + ..  . +.. + + + +.  +. +(m) T +θn +— (x ) — +Then, by computing the matrix product Xθ, we have + +  +— (x(1) )T θ — +— θT (x(1) ) — + — (x(2) )T θ —   — θT (x(2) ) — + +  +Xθ =  += +.. +.. + +  +. +. +(m) T +T +— (x ) θ — +— θ (x(m) ) — + + + + +. + + +In the last equality, we used the fact that aT b = bT a if a and b are vectors. +This allows us to compute the products θT x(i) for all our examples i in one +line of code. +Your job is to write the unregularized cost function in the file lrCostFunction.m +Your implementation should use the strategy we presented above to calculate θT x(i) . You should also use a vectorized approach for the rest of the +cost function. A fully vectorized version of lrCostFunction.m should not +contain any loops. +(Hint: You might want to use the element-wise multiplication operation +(.*) and the sum operation sum when writing this function) +1.3.2 + +Vectorizing the gradient + +Recall that the gradient of the (unregularized) logistic regression cost is a +vector where the j th element is defined as +1 +∂J += +∂θj +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj + +. + +i=1 + +To vectorize this operation over the dataset, we start by writing out all + +5 + + the partial derivatives explicitly for all θj , + +m +(i) +(i) (i) + ∂J  +i=1 (hθ (x ) − y )x0 + +∂θ0 + + ∂J  +(i) +m + + ∂θ1  +(hθ (x(i) ) − y (i) )x1 + +i=1 + + + + ∂J  + ∂θ2  = 1  +(i) +m +(hθ (x(i) ) − y (i) )x2 + + m +i=1 + + .  + + ..  + +.. + + + +. + +∂J +m +(i) +(i) (i) +∂θn +i=1 (hθ (x ) − y )xn += + +1 +m + + + + + + + + + + + + + + +m + +(hθ (x(i) ) − y (i) )x(i) +i=1 + +1 += X T (hθ (x) − y). +m +where + + + + + + +hθ (x) − y =  + + +hθ (x(1) ) − y (1) +hθ (x(2) ) − y (2) +.. +. +hθ (x(1) ) − y (m) + +(1) + + + +. + + +Note that x(i) is a vector, while (hθ (x(i) )−y (i) ) is a scalar (single number). +To understand the last step of the derivation, let βi = (hθ (x(i) ) − y (i) ) and +observe that: + + + β1 + +| +| +| + β2  + +(i) +(1) +(2) +(m)   + +βi x = x +x +... x + ..  = X T β, + .  +i +| +| +| +βm +where the values βi = (hθ (x(i) ) − y (i) ). +The expression above allows us to compute all the partial derivatives +without any loops. If you are comfortable with linear algebra, we encourage +you to work through the matrix multiplications above to convince yourself +that the vectorized version does the same computations. You should now +implement Equation 1 to compute the correct vectorized gradient. Once you +are done, complete the function lrCostFunction.m by implementing the +gradient. + +6 + + Debugging Tip: Vectorizing code can sometimes be tricky. One common strategy for debugging is to print out the sizes of the matrices you +are working with using the size function. For example, given a data matrix X of size 100 × 20 (100 examples, 20 features) and θ, a vector with +dimensions 20 × 1, you can observe that Xθ is a valid multiplication operation, while θX is not. Furthermore, if you have a non-vectorized version +of your code, you can compare the output of your vectorized code and +non-vectorized code to make sure that they produce the same outputs. +1.3.3 + +Vectorizing regularized logistic regression + +After you have implemented vectorization for logistic regression, you will now +add regularization to the cost function. Recall that for regularized logistic +regression, the cost function is defined as +1 +J(θ) = +m + +m + +−y + +(i) + +i=1 + +λ +log(hθ (x )) − (1 − y ) log(1 − hθ (x )) + +2m +(i) + +(i) + +n + +(i) + +θj2 . +j=1 + +Note that you should not be regularizing θ0 which is used for the bias +term. +Correspondingly, the partial derivative of regularized logistic regression +cost for θj is defined as +1 +∂J += +∂θ0 +m +∂J +1 += +∂θj +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj ) + +for j = 0 + +i=1 +m +(i) + +(hθ (x(i) ) − y (i) )xj + λθj + +for j ≥ 1. + +i=1 + +Now modify your code in lrCostFunction to account for regularization. +Once again, you should not put any loops into your code. + +7 + + Octave Tip: When implementing the vectorization for regularized logistic regression, you might often want to only sum and update certain +elements of θ. In Octave, you can index into the matrices to access and +update only certain elements. For example, A(:, 3:5) = B(:, 1:3) will +replaces the columns 3 to 5 of A with the columns 1 to 3 from B. One +special keyword you can use in indexing is the end keyword in indexing. +This allows us to select columns (or rows) until the end of the matrix. +For example, A(:, 2:end) will only return elements from the 2nd to last +column of A. Thus, you could use this together with the sum and .^ operations to compute the sum of only the elements you are interested in +(e.g., sum(z(2:end).^2)). In the starter code, lrCostFunction.m, we +have also provided hints on yet another possible method computing the +regularized gradient. +You should now submit your vectorized logistic regression cost function. + +1.4 + +One-vs-all Classification + +In this part of the exercise, you will implement one-vs-all classification by +training multiple regularized logistic regression classifiers, one for each of +the K classes in our dataset (Figure 1). In the handwritten digits dataset, +K = 10, but your code should work for any value of K. +You should now complete the code in oneVsAll.m to train one classifier for +each class. In particular, your code should return all the classifier parameters +in a matrix Θ ∈ RK×(N +1) , where each row of Θ corresponds to the learned +logistic regression parameters for one class. You can do this with a “for”-loop +from 1 to K, training each classifier independently. +Note that the y argument to this function is a vector of labels from 1 to +10, where we have mapped the digit “0” to the label 10 (to avoid confusions +with indexing). +When training the classifier for class k ∈ {1, ..., K}, you will want a mdimensional vector of labels y, where yj ∈ 0, 1 indicates whether the j-th +training instance belongs to class k (yj = 1), or if it belongs to a different +class (yj = 0). You may find logical arrays helpful for this task. + +8 + + Octave Tip: Logical arrays in Octave are arrays which contain binary (0 +or 1) elements. In Octave, evaluating the expression a == b for a vector a +(of size m × 1) and scalar b will return a vector of the same size as a with +ones at positions where the elements of a are equal to b and zeroes where +they are different. To see how this works for yourself, try the following +code in Octave: +a = 1:10; % Create a and b +b = 3; +a == b +% You should try different values of b here +Furthermore, you will be using fmincg for this exercise (instead of fminunc). +fmincg works similarly to fminunc, but is more more efficient for dealing with +a large number of parameters. +After you have correctly completed the code for oneVsAll.m, the script +ex3.m will continue to use your oneVsAll function to train a multi-class classifier. +You should now submit the training function for one-vs-all classification. +1.4.1 + +One-vs-all Prediction + +After training your one-vs-all classifier, you can now use it to predict the +digit contained in a given image. For each input, you should compute the +“probability” that it belongs to each class using the trained logistic regression +classifiers. Your one-vs-all prediction function will pick the class for which the +corresponding logistic regression classifier outputs the highest probability and +return the class label (1, 2,..., or K) as the prediction for the input example. +You should now complete the code in predictOneVsAll.m to use the +one-vs-all classifier to make predictions. +Once you are done, ex3.m will call your predictOneVsAll function using +the learned value of Θ. You should see that the training set accuracy is about +94.9% (i.e., it classifies 94.9% of the examples in the training set correctly). +You should now submit the prediction function for one-vs-all classification. + +9 + + 2 + +Neural Networks + +In the previous part of this exercise, you implemented multi-class logistic regression to recognize handwritten digits. However, logistic regression cannot +form more complex hypotheses as it is only a linear classifier.2 +In this part of the exercise, you will implement a neural network to recognize handwritten digits using the same training set as before. The neural +network will be able to represent complex models that form non-linear hypotheses. For this week, you will be using parameters from a neural network +that we have already trained. Your goal is to implement the feedforward +propagation algorithm to use our weights for prediction. In next week’s exercise, you will write the backpropagation algorithm for learning the neural +network parameters. +The provided script, ex3 nn.m, will help you step through this exercise. + +2.1 + +Model representation + +Our neural network is shown in Figure 2. It has 3 layers – an input layer, a +hidden layer and an output layer. Recall that our inputs are pixel values of +digit images. Since the images are of size 20×20, this gives us 400 input layer +units (excluding the extra bias unit which always outputs +1). As before, +the training data will be loaded into the variables X and y. +You have been provided with a set of network parameters (Θ(1) , Θ(2) ) +already trained by us. These are stored in ex3weights.mat and will be +loaded by ex3 nn.m into Theta1 and Theta2 The parameters have dimensions +that are sized for a neural network with 25 units in the second layer and 10 +output units (corresponding to the 10 digit classes). +% Load saved matrices from file +load('ex3weights.mat'); +% +% +% +% + +The matrices Theta1 and Theta2 will now be in your Octave +environment +Theta1 has size 25 x 401 +Theta2 has size 26 x 10 +2 + +You could add more features (such as polynomial features) to logistic regression, but +that can be very expensive to train. + +10 + + Figure 2: Neural network model. + +2.2 + +Feedforward Propagation and Prediction + +Now you will implement feedforward propagation for the neural network. You +will need to complete the code in predict.m to return the neural network’s +prediction. +You should implement the feedforward computation that computes hθ (x(i) ) +for every example i and returns the associated predictions. Similar to the +one-vs-all classification strategy, the prediction from the neural network will +be the label that has the largest output (hθ (x))k . +Implementation Note: The matrix X contains the examples in rows. +When you complete the code in predict.m, you will need to add the +column of 1’s to the matrix. The matrices Theta1 and Theta2 contain +the parameters for each unit in rows. Specifically, the first row of Theta1 +corresponds to the first hidden unit in the second layer. In Octave, when +you compute z (2) = Θ(1) a(1) , be sure that you index (and if necessary, +transpose) X correctly so that you get a(l) as a column vector. +Once you are done, ex3 nn.m will call your nnCostFunction using the +loaded set of parameters for Theta1 and Theta2, and λ = 1. You should + +11 + + see that the accuracy is about 97.5%. After that, an interactive sequence +will launch displaying images from the training set one at a time, while the +console prints out the predicted label for the displayed image. To stop the +image sequence, press Ctrl-C. +You should now submit the neural network prediction function. + +Submission and Grading +After completing this assignment, be sure to use the submit function to submit your solutions to our servers. The following is a breakdown of how each +part of this exercise is scored. +Part +Regularized Logisic Regression +One-vs-all classifier training +One-vs-all classifier prediction +Neural Network Prediction Function +Total Points + +Submitted File +lrCostFunction.m +oneVsAll.m +predictOneVsAll.m +predict.m + +Points +30 points +20 points +20 points +30 points +100 points + +You are allowed to submit your solutions multiple times, and we will +take the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, November 6th +at 23:59:59 PDT. + +12 + + \ No newline at end of file diff --git a/Multi-class classification and neural networks/mlclass-ex3/displayData.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/displayData.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/displayData.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/displayData.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/ex3.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/ex3.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/ex3_nn.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3_nn.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/ex3_nn.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3_nn.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/ex3data1.mat b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3data1.mat similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/ex3data1.mat rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3data1.mat diff --git a/Multi-class classification and neural networks/mlclass-ex3/ex3weights.mat b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3weights.mat similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/ex3weights.mat rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/ex3weights.mat diff --git a/Multi-class classification and neural networks/mlclass-ex3/fmincg.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/fmincg.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/fmincg.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/fmincg.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/lrCostFunction.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/lrCostFunction.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/lrCostFunction.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/lrCostFunction.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/oneVsAll.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/oneVsAll.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/oneVsAll.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/oneVsAll.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/predict.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/predict.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/predict.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/predict.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/predictOneVsAll.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/predictOneVsAll.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/predictOneVsAll.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/predictOneVsAll.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/sigmoid.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/sigmoid.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/sigmoid.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/sigmoid.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/submit.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/submit.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/submit.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/submit.m diff --git a/Multi-class classification and neural networks/mlclass-ex3/submitWeb.m b/MultiClassclassificationandNeuralNetworks/mlclass-ex3/submitWeb.m similarity index 100% rename from Multi-class classification and neural networks/mlclass-ex3/submitWeb.m rename to MultiClassclassificationandNeuralNetworks/mlclass-ex3/submitWeb.m diff --git a/Neural network learning/mlclass-ex4/octave-core b/Neural network learning/mlclass-ex4/octave-core deleted file mode 100644 index c1dc681..0000000 Binary files a/Neural network learning/mlclass-ex4/octave-core and /dev/null differ diff --git a/Neural network learning/ex4.pdf b/Neural_network_learning/ex4.pdf similarity index 100% rename from Neural network learning/ex4.pdf rename to Neural_network_learning/ex4.pdf diff --git a/Neural_network_learning/ex4.txt b/Neural_network_learning/ex4.txt new file mode 100644 index 0000000..19abbe3 --- /dev/null +++ b/Neural_network_learning/ex4.txt @@ -0,0 +1,653 @@ +Programming Exercise 4: +Neural Networks Learning +Machine Learning +November 4, 2011 + +Introduction +In this exercise, you will implement the backpropagation algorithm for neural +networks and apply it to the task of hand-written digit recognition. Before +starting on the programming exercise, we strongly recommend watching the +video lectures and completing the review questions for the associated topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. + +Files included in this exercise +ex4.m - Octave script that will help step you through the exercise +ex4data1.mat - Training set of hand-written digits +ex4weights.mat - Neural network parameters for exercise 4 +submit.m - Submission script that sends your solutions to our servers +submitWeb.m - Alternative submission script +displayData.m - Function to help visualize the dataset +fmincg.m - Function minimization routine (similar to fminunc) +sigmoid.m - Sigmoid function +computeNumericalGradient.m - Numerically compute gradients +checkNNGradients.m - Function to help check your gradients +debugInitializeWeights.m - Function for initializing weights +predict.m - Neural network prediction function +[ ] sigmoidGradient.m - Compute the gradient of the sigmoid function +[ ] randInitializeWeights.m - Randomly initialize weights +1 + + [ ] nnCostFunction.m - Neural network cost function +indicates files you will need to complete +Throughout the exercise, you will be using the script ex4.m. These scripts +set up the dataset for the problems and make calls to functions that you will +write. You do not need to modify the script. You are only required to modify +functions in other files, by following the instructions in this assignment. + +Where to get help +We also strongly encourage using the online Q&A Forum to discuss discuss +exercises with other students. However, do not look at any source code +written by others or share your source code with others. +If you run into network errors using the submit script, you can also use +an online form for submitting your solutions. To use this alternative submission interface, run the submitWeb script to generate a submission file (e.g., +submit ex2 part1.txt). You can then submit this file through the web +submission form in the programming exercises page (go to the programming +exercises page, then select the exercise you are submitting for). If you are +having no problems submitting through the standard submission system using the submit script, you do not need to use this alternative submission +interface. + +1 + +Neural Networks + +In the previous exercise, you implemented feedforward propagation for neural networks and used it to predict handwritten digits with the weights we +provided. In this exercise, you will implement the backpropagation algorithm +to learn the parameters for the neural network. +The provided script, ex4.m, will help you step through this exercise. + +1.1 + +Visualizing the data + +In the first part of ex4.m, the code will load the data and display it on a +2-dimensional plot (Figure 1) by calling the function displayData. +This is the same dataset that you used in the previous exercise. There are +5000 training examples in ex3data1.mat, where each training example is a +2 + + Figure 1: Examples from the dataset +20 pixel by 20 pixel grayscale image of the digit. Each pixel is represented by +a floating point number indicating the grayscale intensity at that location. +The 20 by 20 grid of pixels is “unrolled” into a 400-dimensional vector. Each +of these training examples becomes a single row in our data matrix X. This +gives us a 5000 by 400 matrix X where every row is a training example for a +handwritten digit image. + + +— (x(1) )T — + — (x(2) )T —  + + +X= + +.. + + +. +(m) T +— (x ) — +The second part of the training set is a 5000-dimensional vector y that +contains labels for the training set. To make things more compatible with +Octave/Matlab indexing, where there is no zero index, we have mapped the +digit zero to the value ten. Therefore, a “0” digit is labeled as “10”, while +the digits “1” to “9” are labeled as “1” to “9” in their natural order. + +1.2 + +Model representation + +Our neural network is shown in Figure 2. It has 3 layers – an input layer, +a hidden layer and an output layer. Recall that our inputs are pixel values +of digit images. Since the images are of size 20 × 20, this gives us 400 input + +3 + + layer units (not counting the extra bias unit which always outputs +1). The +training data will be loaded into the variables X and y by the ex4.m script. +You have been provided with a set of network parameters (Θ(1) , Θ(2) ) +already trained by us. These are stored in ex4weights.mat and will be +loaded by ex4.m into Theta1 and Theta2. The parameters have dimensions +that are sized for a neural network with 25 units in the second layer and 10 +output units (corresponding to the 10 digit classes). +% Load saved matrices from file +load('ex4weights.mat'); +% The matrices Theta1 and Theta2 will now be in your workspace +% Theta1 has size 25 x 401 +% Theta2 has size 10 x 26 + +Figure 2: Neural network model. + +1.3 + +Feedforward and cost function + +Now you will implement the cost function and gradient for the neural network. First, complete the code in nnCostFunction.m to return the cost. + +4 + + Recall that the cost function for the neural network (without regularization) is +J(θ) = + +1 +m + +m + +K +(i) + +(i) + +−yk log((hθ (x(i) ))k ) − (1 − yk ) log(1 − (hθ (x(i) ))k ) , +i=1 k=1 + +where hθ (x(i) ) is computed as shown in the Figure 2 and K = 10 is the total +(3) +number of possible labels. Note that hθ (x(i) )k = ak is the activation (output +value) of the k-th output unit. Also, recall that whereas the original labels +(in the variable y) were 1, 2, ..., 10, for the purpose of training a neural +network, we need to recode the labels as vectors containing only values 0 or +1, so that +  +  +  +1 +0 +0 + 0  + 1  + 0  +  +  +  +  +  +  +y =  0  ,  0  , . . . or  0  . + ..  + ..  + ..  + .  + .  + .  +0 +0 +1 +For example, if x(i) is an image of the digit 5, then the corresponding +y (i) (that you should use with the cost function) should be a 10-dimensional +vector with y5 = 1, and the other elements equal to 0. +You should implement the feedforward computation that computes hθ (x(i) ) +for every example i and sum the cost over all examples. Your code should +also work for a dataset of any size, with any number of labels (you +can assume that there are always at least K ≥ 3 labels). +Implementation Note: The matrix X contains the examples in rows +(i.e., X(i,:)’ is the i-th training example x(i) , expressed as a n × 1 +vector.) When you complete the code in nnCostFunction.m, you will +need to add the column of 1’s to the X matrix. The parameters for each +unit in the neural network is represented in Theta1 and Theta2 as one +row. Specifically, the first row of Theta1 corresponds to the first hidden +unit in the second layer. You can use a for-loop over the examples to +compute the cost. +Once you are done, ex4.m will call your nnCostFunction using the loaded +set of parameters for Theta1 and Theta2. You should see that the cost is +about 0.287629. + +5 + + You should now submit the neural network cost function (feedforward). + +1.4 + +Regularized cost function + +The cost function for neural networks with regularization is given by + +1 +J(θ) = +m + +m + +K +(i) + +(i) + +−yk log((hθ (x(i) ))k ) − (1 − yk ) log(1 − (hθ (x(i) ))k ) + +i=1 k=1 +25 400 + +λ +2m + +10 +(1) +(Θj,k )2 + +j=1 k=1 + +25 +(2) + +(Θj,k )2 . + ++ +j=1 k=1 + +You can assume that the neural network will only have 3 layers – an input +layer, a hidden layer and an output layer. However, your code should work +for any number of input units, hidden units and outputs units. While we +have explicitly listed the indices above for Θ(1) and Θ(2) for clarity, do note +that your code should in general work with Θ(1) and Θ(2) of any size. +Note that you should not be regularizing the terms that correspond to +the bias. For the matrices Theta1 and Theta2, this corresponds to the first +column of each matrix. You should now add regularization to your cost +function. Notice that you can first compute the unregularized cost function +J using your existing nnCostFunction.m and then later add the cost for the +regularization terms. +Once you are done, ex4.m will call your nnCostFunction using the loaded +set of parameters for Theta1 and Theta2, and λ = 1. You should see that +the cost is about 0.383770. +You should now submit the regularized neural network cost function (feedforward). + +2 + +Backpropagation + +In this part of the exercise, you will implement the backpropagation algorithm to compute the gradient for the neural network cost function. You +will need to complete the nnCostFunction.m so that it returns an appropriate value for grad. Once you have computed the gradient, you will be able +6 + + to train the neural network by minimizing the cost function J(Θ) using an +advanced optimizer such as fmincg. +You will first implement the backpropagation algorithm to compute the +gradients for the parameters for the (unregularized) neural network. After +you have verified that your gradient computation for the unregularized case +is correct, you will implement the gradient for the regularized neural network. + +2.1 + +Sigmoid gradient + +To help you get started with this part of the exercise, you will first implement +the sigmoid gradient function. The gradient for the sigmoid function can be +computed as +d +g (z) = g(z) = g(z)(1 − g(z)) +dz +where +1 +. +1 + e−z +When you are done, try testing a few values by calling sigmoidGradient(z) +at the Octave command line. For large values (both positive and negative) +of z, the gradient should be close to 0. When z = 0, the gradient should be +exactly 0.25. Your code should also work with vectors and matrices. For a +matrix, your function should perform the sigmoid gradient function on every +element. +sigmoid(z) = g(z) = + +You should now submit the sigmoid gradient function. + +2.2 + +Random initialization + +When training neural networks, it is important to randomly initialize the parameters for symmetry breaking. One effective strategy for random initialization is to randomly select values for Θ(l) uniformly in the range [− init , init ]. +You should use init = 0.12.1 This range of values ensures that the parameters +are kept small and makes the learning more efficient. +Your job is to complete randInitializeWeights.m to initialize the weights +for Θ; modify the file and fill in the following code: +1 + +One effective strategy for choosing init is +to base it on the number of units in the +√ +6 +network. A good choice of init is init = √L +L +, where Lin = sl and Lout = sl+1 are +in + +out + +the number of units in the layers adjacent to Θ(l) . + +7 + + % Randomly initialize the weights to small values +epsilon init = 0.12; +W = rand(L out, 1 + L in) * 2 * epsilon init − epsilon init; + +You do not need to submit any code for this part of the exercise. + +2.3 + +Backpropagation + +Figure 3: Backpropagation Updates. +Now, you will implement the backpropagation algorithm. Recall that +the intuition behind the backpropagation algorithm is as follows. Given a +training example (x(t) , y (t) ), we will first run a “forward pass” to compute +all the activations throughout the network, including the output value of the +hypothesis hΘ (x). Then, for each node j in layer l, we would like to compute +(l) +an “error term” δj that measures how much that node was “responsible” +for any errors in our output. +For an output node, we can directly measure the difference between the +(3) +network’s activation and the true target value, and use that to define δj +(since layer 3 is the output layer). For the hidden units, you will compute +(l) +δj based on a weighted average of the error terms of the nodes in layer +(l + 1). +8 + + In detail, here is the backpropagation algorithm (also depicted in Figure +3). You should implement steps 1 to 4 in a loop that processes one example +at a time. Concretely, you should implement a for-loop for t = 1:m and +place steps 1-4 below inside the for-loop, with the tth iteration performing +the calculation on the tth training example (x(t) , y (t) ). Step 5 will divide the +accumulated gradients by m to obtain the gradients for the neural network +cost function. +1. Set the input layer’s values (a(1) ) to the t-th training example x(t) . +Perform a feedforward pass (Figure 2), computing the activations (z (2) , a(2) , z (3) , a(3) ) +for layers 2 and 3. Note that you need to add a +1 term to ensure that +the vectors of activations for layers a(1) and a(2) also include the bias +unit. In Octave, if a 1 is a column vector, adding one corresponds to +a 1 = [1 ; a 1]. +2. For each output unit k in layer 3 (the output layer), set +(3) + +(3) + +δk = (ak − yk ), +where yk ∈ {0, 1} indicates whether the current training example belongs to class k (yk = 1), or if it belongs to a different class (yk = 0). +You may find logical arrays helpful for this task (explained in the previous programming exercise). +3. For the hidden layer l = 2, set +δ (2) = Θ(2) + +T + +δ (3) . ∗ g (z (2) ) + +4. Accumulate the gradient from this example using the following for(2) +mula. Note that you should skip or remove δ0 . In Octave, removing +(2) +δ0 corresponds to delta 2 = delta 2(2:end). +∆(l) = ∆(l) + δ (l+1) (a(l) )T +5. Obtain the (unregularized) gradient for the neural network cost function by dividing the accumulated gradients by m1 : +∂ +(l) +∂Θij + +(l) + +J(Θ) = Dij = + +9 + +1 (l) +∆ +m ij + + Octave Tip: You should implement the backpropagation algorithm only +after you have successfully completed the feedforward and cost functions. +While implementing the backpropagation algorithm, it is often useful to +use the size function to print out the sizes of the variables you are working with if you run into dimension mismatch errors (“nonconformant +arguments” errors in Octave). +After you have implemented the backpropagation algorithm, the script +ex4.m will proceed to run gradient checking on your implementation. The +gradient check will allow you to increase your confidence that your code is +computing the gradients correctly. + +2.4 + +Gradient checking + +In your neural network, you are minimizing the cost function J(Θ). To +perform gradient checking on your parameters, you can imagine “unrolling” +the parameters Θ(1) , Θ(2) into a long vector θ. By doing so, you can think of +the cost function being J(θ) instead and use the following gradient checking +procedure. +Suppose you have a function fi (θ) that purportedly computes ∂θ∂ i J(θ); +you’d like to check if fi is outputting correct derivative values. +  +  +0 +0 + 0  + 0  + .  + .  + .  + .  + .  + .  +and θ(i−) = θ −   +Let θ(i+) = θ +   +  +  + .  + .  + ..  + ..  +0 +0 +So, θ(i+) is the same as θ, except its i-th element has been incremented by +. Similarly, θ(i−) is the corresponding vector with the i-th element decreased +by . You can now numerically verify fi (θ)’s correctness by checking, for each +i, that: +J(θ(i+) ) − J(θ(i−) ) +fi (θ) ≈ +. +2 +The degree to which these two values should approximate each other will +depend on the details of J. But assuming = 10−4 , you’ll usually find that +the left- and right-hand sides of the above will agree to at least 4 significant +digits (and often many more). +10 + + We have implemented the function to compute the numerical gradient for +you in computeNumericalGradient.m. While you are not required to modify +the file, we highly encourage you to take a look at the code to understand +how it works. +In the next step of ex4.m, it will run the provided function checkNNGradients.m +which will create a small neural network and dataset that will be used for +checking your gradients. If your backpropagation implementation is correct, +you should see a relative difference that is less than 1e-9. +Practical Tip: When performing gradient checking, it is much more +efficient to use a small neural network with a relatively small number +of input units and hidden units, thus having a relatively small number +of parameters. Each dimension of θ requires two evaluations of the cost +function and this can be expensive. In the function checkNNGradients, +our code creates a small random model and dataset which is used with +computeNumericalGradient for gradient checking. Furthermore, after +you are confident that your gradient computations are correct, you should +turn off gradient checking before running your learning algorithm. +Practical Tip: Gradient checking works for any function where you are +computing the cost and the gradient. Concretely, you can use the same +computeNumericalGradient.m function to check if your gradient implementations for the other exercises are correct too (e.g., logistic regression’s +cost function). +Once your cost function passes the gradient check for the (unregularized) +neural network cost function, you should submit the neural network gradient +function (backpropagation). + +2.5 + +Regularized Neural Networks + +After you have successfully implemeted the backpropagation algorithm, you +will add regularization to the gradient. To account for regularization, it +turns out that you can add this as an additional term after computing the +gradients using backpropagation. +(l) +Specifically, after you have computed ∆ij using backpropagation, you +should add regularization using +∂ +(l) +∂Θij + +(l) + +J(Θ) = Dij = + +1 (l) +∆ +m ij +11 + +for j = 0 + + ∂ +(l) +∂Θij + +(l) + +J(Θ) = Dij = + +1 (l) λ (l) +∆ + Θij +m ij +m + +for j ≥ 1 + +Note that you should not be regularizing the first column of Θ(l) which +(l) +is used for the bias term. Furthermore, in the parameters Θij , i is indexed +starting from 1, and j is indexed starting from 0. Thus, + (i) + +(l) +Θ1,0 Θ1,1 . . . +(l) + (i) + +Θ(l) = Θ2,0 Θ2,1 +. +.. +.. +. +. +Somewhat confusingly, indexing in Octave starts from 1 (for both i and +(l) +j), thus Theta1(2, 1) actually corresponds to Θ2,0 (i.e., the entry in the +second row, first column of the matrix Θ(1) shown above) +Now modify your code that computes grad in nnCostFunction to account +for regularization. After you are done, the ex4.m script will proceed to run +gradient checking on your implementation. If your code is correct, you should +expect to see a relative difference that is less than 1e-9. +You should now submit your regularized neural network gradient. + +2.6 + +Learning parameters using fmincg + +After you have successfully implemented the neural network cost function +and gradient computation, the next step of the ex4.m script will use fmincg +to learn a good set parameters. +After the training completes, the ex4.m script will proceed to report the +training accuracy of your classifier by computing the percentage of examples +it got correct. If your implementation is correct, you should see a reported +training accuracy of about 95.3% (this may vary by about 1% due to the +random initialization). It is possible to get higher training accuracies by +training the neural network for more iterations. We encourage you to try +training the neural network for more iterations (e.g., set MaxIter to 400) and +also vary the regularization parameter λ. With the right learning settings, it +is possible to get the neural network to perfectly fit the training set. + +3 + +Visualizing the hidden layer + +One way to understand what your neural network is learning is to visualize +what the representations captured by the hidden units. Informally, given a +12 + + particular hidden unit, one way to visualize what it computes is to find an +input x that will cause it to activate (that is, to have an activation value +(l) +(ai ) close to 1). For the neural network you trained, notice that the ith row +of Θ(1) is a 401-dimensional vector that represents the parameter for the ith +hidden unit. If we discard the bias term, we get a 400 dimensional vector +that represents the weights from each input pixel to the hidden unit. +Thus, one way to visualize the “representation” captured by the hidden +unit is to reshape this 400 dimensional vector into a 20 × 20 image and +display it.2 The next step of ex4.m does this by using the displayData +function and it will show you an image (similar to Figure 4) with 25 units, +each corresponding to one hidden unit in the network. +In your trained network, you should find that the hidden units corresponds roughly to detectors that look for strokes and other patterns in the +input. + +Figure 4: Visualization of Hidden Units. + +3.1 + +Optional (ungraded) exercise + +In this part of the exercise, you will get to try out different learning settings +for the neural network to see how the performance of the neural network +varies with the regularization parameter λ and number of training steps (the +MaxIter option when using fmincg). +Neural networks are very powerful models that can form highly complex +decision boundaries. Without regularization, it is possible for a neural network to “overfit” a training set so that it obtains close to 100% accuracy on +2 + +It turns out that this is equivalent to finding the input that gives the highest activation +for the hidden unit, given a “norm” constraint on the input (i.e., x 2 ≤ 1). + +13 + + the training set but does not as well on new examples that it has not seen +before. You can set the regularization λ to a smaller value and the MaxIter +parameter to a higher number of iterations to see this for youself. +You will also be able to see for yourself the changes in the visualizations +of the hidden units when you change the learning parameters λ and MaxIter. +You do not need to submit any solutions for these optional (ungraded) +exercise. + +14 + + Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Part +Feedforward and Cost Function +Regularized Cost Function +Sigmoid Gradient +Neural Net Gradient Function +(Backpropagation) +Regularized Gradient +Total Points + +Submitted File +nnCostFunction.m +nnCostFunction.m +sigmoidGradient.m +nnCostFunction.m + +Points +30 points +15 points +5 points +40 points + +nnCostFunction.m + +10 points +100 points + +You are allowed to submit your solutions multiple times, and we will take +only the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, November 13th +at 23:59:59 PDT. + +15 + + \ No newline at end of file diff --git a/Neural network learning/mlclass-ex4/checkNNGradients.m b/Neural_network_learning/mlclass-ex4/checkNNGradients.m similarity index 100% rename from Neural network learning/mlclass-ex4/checkNNGradients.m rename to Neural_network_learning/mlclass-ex4/checkNNGradients.m diff --git a/Neural network learning/mlclass-ex4/computeNumericalGradient.m b/Neural_network_learning/mlclass-ex4/computeNumericalGradient.m similarity index 100% rename from Neural network learning/mlclass-ex4/computeNumericalGradient.m rename to Neural_network_learning/mlclass-ex4/computeNumericalGradient.m diff --git a/Neural network learning/mlclass-ex4/debugInitializeWeights.m b/Neural_network_learning/mlclass-ex4/debugInitializeWeights.m similarity index 100% rename from Neural network learning/mlclass-ex4/debugInitializeWeights.m rename to Neural_network_learning/mlclass-ex4/debugInitializeWeights.m diff --git a/Neural network learning/mlclass-ex4/displayData.m b/Neural_network_learning/mlclass-ex4/displayData.m similarity index 100% rename from Neural network learning/mlclass-ex4/displayData.m rename to Neural_network_learning/mlclass-ex4/displayData.m diff --git a/Neural network learning/mlclass-ex4/ex4.m b/Neural_network_learning/mlclass-ex4/ex4.m similarity index 100% rename from Neural network learning/mlclass-ex4/ex4.m rename to Neural_network_learning/mlclass-ex4/ex4.m diff --git a/Neural network learning/mlclass-ex4/ex4data1.mat b/Neural_network_learning/mlclass-ex4/ex4data1.mat similarity index 100% rename from Neural network learning/mlclass-ex4/ex4data1.mat rename to Neural_network_learning/mlclass-ex4/ex4data1.mat diff --git a/Neural network learning/mlclass-ex4/ex4weights.mat b/Neural_network_learning/mlclass-ex4/ex4weights.mat similarity index 100% rename from Neural network learning/mlclass-ex4/ex4weights.mat rename to Neural_network_learning/mlclass-ex4/ex4weights.mat diff --git a/Neural network learning/mlclass-ex4/fmincg.m b/Neural_network_learning/mlclass-ex4/fmincg.m similarity index 100% rename from Neural network learning/mlclass-ex4/fmincg.m rename to Neural_network_learning/mlclass-ex4/fmincg.m diff --git a/Neural network learning/mlclass-ex4/nnCostFunction.m b/Neural_network_learning/mlclass-ex4/nnCostFunction.m similarity index 100% rename from Neural network learning/mlclass-ex4/nnCostFunction.m rename to Neural_network_learning/mlclass-ex4/nnCostFunction.m diff --git a/Neural network learning/mlclass-ex4/predict.m b/Neural_network_learning/mlclass-ex4/predict.m similarity index 100% rename from Neural network learning/mlclass-ex4/predict.m rename to Neural_network_learning/mlclass-ex4/predict.m diff --git a/Neural network learning/mlclass-ex4/randInitializeWeights.m b/Neural_network_learning/mlclass-ex4/randInitializeWeights.m similarity index 100% rename from Neural network learning/mlclass-ex4/randInitializeWeights.m rename to Neural_network_learning/mlclass-ex4/randInitializeWeights.m diff --git a/Neural network learning/mlclass-ex4/sigmoid.m b/Neural_network_learning/mlclass-ex4/sigmoid.m similarity index 100% rename from Neural network learning/mlclass-ex4/sigmoid.m rename to Neural_network_learning/mlclass-ex4/sigmoid.m diff --git a/Neural network learning/mlclass-ex4/sigmoidGradient.m b/Neural_network_learning/mlclass-ex4/sigmoidGradient.m similarity index 100% rename from Neural network learning/mlclass-ex4/sigmoidGradient.m rename to Neural_network_learning/mlclass-ex4/sigmoidGradient.m diff --git a/Neural network learning/mlclass-ex4/submit.m b/Neural_network_learning/mlclass-ex4/submit.m similarity index 100% rename from Neural network learning/mlclass-ex4/submit.m rename to Neural_network_learning/mlclass-ex4/submit.m diff --git a/Neural network learning/mlclass-ex4/submitWeb.m b/Neural_network_learning/mlclass-ex4/submitWeb.m similarity index 100% rename from Neural network learning/mlclass-ex4/submitWeb.m rename to Neural_network_learning/mlclass-ex4/submitWeb.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/octave-core b/Regularized linear regression and bias-variance/mlclass-ex5/octave-core deleted file mode 100644 index 0064c27..0000000 Binary files a/Regularized linear regression and bias-variance/mlclass-ex5/octave-core and /dev/null differ diff --git a/Regularized linear regression and bias-variance/ex5.pdf b/RegularizedLinearRegressionAndBiasVariance/ex5.pdf similarity index 100% rename from Regularized linear regression and bias-variance/ex5.pdf rename to RegularizedLinearRegressionAndBiasVariance/ex5.pdf diff --git a/RegularizedLinearRegressionAndBiasVariance/ex5.txt b/RegularizedLinearRegressionAndBiasVariance/ex5.txt new file mode 100644 index 0000000..5bd7aeb --- /dev/null +++ b/RegularizedLinearRegressionAndBiasVariance/ex5.txt @@ -0,0 +1,814 @@ +Programming Exercise 5: +Regularized Linear Regression and Bias v.s. +Variance +Machine Learning +November 11, 2011 + +Introduction +In this exercise, you will implement regularized linear regression and use it to +study models with different bias-variance properties. Before starting on the +programming exercise, we strongly recommend watching the video lectures +and completing the review questions for the associated topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. + +Files included in this exercise +ex5.m - Octave script that will help step you through the exercise +ex5data1.mat - Dataset +submit.m - Submission script that sends your solutions to our servers +submitWeb.m - Alternative submission script +featureNormalize.m - Feature normalization function +fmincg.m - Function minimization routine (similar to fminunc) +plotFit.m - Plot a polynomial fit +trainLinearReg.m - Trains linear regression using your cost function +[ ] linearRegCostFunction.m - Regularized linear regression cost function +[ ] learningCurve.m - Generates a learning curve +[ ] polyFeatures.m - Maps data into polynomial feature space +1 + + [ ] validationCurve.m - Generates a cross validation curve +indicates files you will need to complete +Throughout the exercise, you will be using the script ex5.m. These scripts +set up the dataset for the problems and make calls to functions that you will +write. You are only required to modify functions in other files, by following +the instructions in this assignment. + +Where to get help +We also strongly encourage using the online Q&A Forum to discuss discuss +exercises with other students. However, do not look at any source code +written by others or share your source code with others. +If you run into network errors using the submit script, you can also use +an online form for submitting your solutions. To use this alternative submission interface, run the submitWeb script to generate a submission file (e.g., +submit ex5 part2.txt). You can then submit this file through the web +submission form in the programming exercises page (go to the programming +exercises page, then select the exercise you are submitting for). If you are +having no problems submitting through the standard submission system using the submit script, you do not need to use this alternative submission +interface. + +1 + +Regularized Linear Regression + +In the first half of the exercise, you will implement regularized linear regression to predict the amount of water flowing out of a dam using the change +of water level in a reservoir. In the next half, you will go through some diagnostics of debugging learning algorithms and examine the effects of bias v.s. +variance. +The provided script, ex5.m, will help you step through this exercise. + +1.1 + +Visualizing the dataset + +We will begin by visualizing the dataset containing historical records on the +change in the water level, x, and the amount of water flowing out of the dam, +y. +2 + + This dataset is divided into three parts: +❼ A training set that your model will learn on: X, y +❼ A cross validation set for determining the regularization parameter: +Xval, yval +❼ A test set for evaluating performance. These are “unseen” examples +which your model did not see during training: Xtest, ytest + +The next step of ex5.m will plot the training data (Figure 1). In the +following parts, you will implement linear regression and use that to fit a +straight line to the data and plot learning curves. Following that, you will +implement polynomial regression to find a better fit to the data. +40 + +Water flowing out of the dam (y) + +35 + +30 + +25 + +20 + +15 + +10 + +5 + +0 +−50 + +−40 + +−30 + +−20 + +−10 +0 +10 +Change in water level (x) + +20 + +30 + +40 + +Figure 1: Data + +1.2 + +Regularized linear regression cost function + +Recall that regularized linear regression has the following cost function: +J(θ) = + +1 +2m + +m + +(hθ (x(i) ) − y (i) )2 +i=1 + ++ + +λ +2m + +n + +θj2 + +, + +j=1 + +where λ is a regularization parameter which controls the degree of regularization (thus, help preventing overfitting). The regularization term puts +3 + + a penalty on the overal cost J. As the magnitudes of the model parameters +θj increase, the penalty increases as well. Note that you should not regularize the θ0 term. (In Octave, the θ0 term is represented as theta(1) since +indexing in Octave starts from 1). +You should now complete the code in the file linearRegCostFunction.m. +Your task is to write a function to calculate the regularized linear regression +cost function. If possible, try to vectorize your code and avoid writing loops. +When you are finished, the next part of ex5.m will run your cost function +using theta initialized at [1; 1]. You should expect to see an output of +303.993. +You should now submit your regularized linear regression cost function. + +1.3 + +Regularized linear regression gradient + +Correspondingly, the partial derivative of regularized linear regression’s cost +for θj is defined as +1 +∂J(θ) += +∂θ0 +m +∂J(θ) += +∂θj + +m +(i) + +(hθ (x(i) ) − y (i) )xj + +for j = 0 + +i=1 + +1 +m + +m +(i) + +(hθ (x(i) ) − y (i) )xj +i=1 + ++ + +λ +θj +m + +for j ≥ 1 + +In linearRegCostFunction.m, add code to calculate the gradient, returning it in the variable grad. When you are finished, the next part of +ex5.m will run your gradient function using theta initialized at [1; 1]. +You should expect to see a gradient of [-15.30; 598.250]. +You should now submit your regularized linear regression gradient function. + +1.4 + +Fitting linear regression + +Once your cost function and gradient are working correctly, the next part of +ex5.m will run the code in trainLinearReg.m to compute the optimal values +of θ. This training function uses fmincg to optimize the cost function. +In this part, we set regularization parameter λ to zero. Because our +current implementation of linear regression is trying to fit a 2-dimensional θ, +regularization will not be incredibly helpful for a θ of such low dimension. In +4 + + the later parts of the exercise, you will be using polynomial regression with +regularization. +Finally, the ex5.m script should also plot the best fit line, resulting in +an image similar to Figure 2. The best fit line tells us that the model is +not a good fit to the data because the data has a non-linear pattern. While +visualizing the best fit as shown is one possible way to debug your learning +algorithm, it is not always easy to visualize the data and model. In the next +section, you will implement a function to generate learning curves that can +help you debug your learning algorithm even if it is not easy to visualize the +data. +40 +35 + +Water flowing out of the dam (y) + +30 +25 +20 +15 +10 +5 +0 +−5 +−50 + +−40 + +−30 + +−20 + +−10 +0 +10 +Change in water level (x) + +20 + +30 + +40 + +Figure 2: Linear Fit + +2 + +Bias-variance + +An important concept in machine learning is the bias-variance tradeoff. Models with high bias are not complex enough for the data and tend to underfit, +while models with high variance overfit to the training data. +In this part of the exercise, you will plot training and test errors on a +learning curve to diagnose bias-variance problems. + +5 + + 2.1 + +Learning curves + +You will now implement code to generate the learning curves that will be +useful in debugging learning algorithms. Recall that a learning curve plots +training and cross validation error as a function of training set size. Your +job is to fill in learningCurve.m so that it returns a vector of errors for the +training set and cross validation set. +To plot the learning curve, we need a training and cross validation set +error for different training set sizes. To obtain different training set sizes, +you should use different subsets of the original training set X. Specifically, for +a training set size of i, you should use the first n examples (i.e., X(1:i,:) +and y(1:i)). +You can use the trainLinearReg function to find the θ parameters. Note +that the lambda is passed as a parameter to the learningCurve function. +After learning the θ parameters, you should compute the error on the training and cross validation sets. Recall that the training error for a dataset is +defined as +1 +Jtrain (θ) = +2m + +m + +(hθ (x(i) ) − y (i) )2 . +i=1 + +In particular, note that the training error does not include the regularization term. One way to compute the training error is to use your existing +cost function and set λ to 0 only when using it to compute the training error. +When you are computing the training set error, make sure you compute it +on the training subset (i.e., X(1:n,:) and y(1:n)) (instead of the entire +training set). However, for the cross validation error, you should compute it +over the entire cross validation set. You should store the computed errors in +the vectors error train and error val. +When you are finished, ex5.m wil print the learning curves and produce +a plot similar to Figure 3. +You should now submit your learning curve function. +In Figure 3, you can observe that both the train error and cross validation +error are high when the number of training examples is increased. This +reflects a high bias problem in the model – the linear regression model is +too simple and is unable to fit our dataset well. In the next section, you will +implement polynomial regression to fit a better model for this dataset. + +6 + + Learning curve for linear regression + +150 + +Train +Cross Validation + +Error + +100 + +50 + +0 + +0 + +2 + +4 + +6 +8 +Number of training examples + +10 + +12 + +Figure 3: Linear regression learning curve + +3 + +Polynomial regression + +The problem with our linear model was that it was too simple for the data +and resulted in underfitting (high bias). In this part of the exercise, you will +address this problem by adding more features. +For use polynomial regression, our hypothesis has the form: +hθ (x) = θ0 + θ1 ∗ (waterLevel) + θ2 ∗ (waterLevel)2 + · · · + θp ∗ (waterLevel)p += θ0 + θ1 x1 + θ2 x2 + ... + θp xp . +Notice that by defining x1 = (waterLevel), x2 = (waterLevel)2 , . . . , xp = +(waterLevel)p , we obtain a linear regression model where the features are the +various powers of the original value (waterLevel). +Now, you will add more features using the higher powers of the existing +feature x in the dataset. Your task in this part is to complete the code in +polyFeatures.m so that the function maps the original training set X of size +m × 1 into its higher powers. Specifically, when a training set X of size m × 1 +is passed into the function, the function should return a m×p matrix X poly, +where column 1 holds the original values of X, column 2 holds the values of +X.^2, column 3 holds the values of X.^3, and so on. Note that you don’t +have to account for the zero-eth power in this function. +Now you have a function that will map features to a higher dimension, +7 + + and Part 6 of ex5.m will apply it to the training set, the test set, and the +cross validation set (which you haven’t used yet). +You should now submit your polynomial feature mapping function. + +3.1 + +Learning Polynomial Regression + +After you have completed polyFeatures.m, the ex5.m script will proceed to +train polynomial regression using your linear regression cost function. +Keep in mind that even though we have polynomial terms in our feature +vector, we are still solving a linear regression optimization problem. The +polynomial terms have simply turned into features that we can use for linear +regression. We are using the same cost function and gradient that you wrote +for the earlier part of this exercise. +For this part of the exercise, you will be using a polynomial of degree 8. +It turns out that if we run the training directly on the projected data, will +not work well as the features would be badly scaled (e.g., an example with +x = 40 will now have a feature x8 = 408 = 6.5 × 1012 ). Therefore, you will +need to use feature normalization. +Before learning the parameters θ for the polynomial regression, ex5.m will +first call featureNormalize and normalize the features of the training set, +storing the mu, sigma parameters separately. We have already implemented +this function for you and it is the same function from the first exercise. +After learning the parameters θ, you should see two plots (Figure 4,5) +generated for polynomial regression with λ = 0. +From Figure 4, you should see that the polynomial fit is able to follow +the datapoints very well - thus, obtaining a low training error. However, the +polynomial fit is very complex and even drops off at the extremes. This is +an indicator that the polynomial regression model is overfitting the training +data and will not generalize well. +To better understand the problems with the unregularized (λ = 0) model, +you can see that the learning curve (Figure 5) shows the same effect where +the low training error is low, but the cross validation error is high. There +is a gap between the training and cross validation errors, indicating a high +variance problem. +One way to combat the overfitting (high-variance) problem is to add +regularization to the model. In the next section, you will get to try different +λ parameters to see how regularization can lead to a better model. + +8 + + Polynomial Regression Fit (lambda = 0.000000) + +40 +30 + +Water flowing out of the dam (y) + +20 +10 +0 +−10 +−20 +−30 +−40 +−50 +−60 +−80 + +−60 + +−40 + +−20 +0 +20 +Change in water level (x) + +40 + +60 + +80 + +Figure 4: Polynomial fit, λ = 0 +Polynomial Regression Learning Curve (lambda = 0.000000) + +100 + +Train +Cross Validation + +90 +80 +70 + +Error + +60 +50 +40 +30 +20 +10 +0 + +0 + +2 + +4 + +6 +8 +Number of training examples + +10 + +12 + +Figure 5: Polynomial learning curve, λ = 0 + +3.2 + +Optional (ungraded) exercise: Adjusting the regularization parameter + +In this section, you will get to observe how the regularization parameter +affects the bias-variance of regularized polynomial regression. You should +now modify the the lambda parameter in the ex5.m and try λ = 1, 100. For +9 + + each of these values, the script should generate a polynomial fit to the data +and also a learning curve. +For λ = 1, you should see a polynomial fit that follows the data trend +well (Figure 6) and a learning curve (Figure 7) showing that both the cross +validation and training error converge to a relatively low value. This shows +the λ = 1 regularized polynomial regression model does not have the highbias or high-variance problems. In effect, it achieves a good trade-off between +bias and variance. +For λ = 100, you should see a polynomial fit (Figure 8) that does not +follow the data well. In this case, there is too much regularization and the +model is unable to fit the training data. +You do not need to submit any solutions for this optional (ungraded) +exercise. +Polynomial Regression Fit (lambda = 1.000000) + +160 + +Water flowing out of the dam (y) + +140 + +120 + +100 + +80 + +60 + +40 + +20 + +0 +−80 + +−60 + +−40 + +−20 +0 +20 +Change in water level (x) + +40 + +60 + +80 + +Figure 6: Polynomial fit, λ = 1 + +3.3 + +Selecting λ using a cross validation set + +From the previous parts of the exercise, you observed that the value of λ +can significantly affect the results of regularized polynomial regression on +the training and cross validation set. In particular, a model without regularization (λ = 0) fits the training set well, but does not generalize. Conversely, +a model with too much regularization (λ = 100) does not fit the training set +10 + + Polynomial Regression Learning Curve (lambda = 1.000000) + +100 + +Train +Cross Validation + +90 +80 +70 + +Error + +60 +50 +40 +30 +20 +10 +0 + +0 + +2 + +4 + +6 +8 +Number of training examples + +10 + +12 + +Figure 7: Polynomial learning curve, λ = 1 +Polynomial Regression Fit (lambda = 100.000000) + +40 +35 + +Water flowing out of the dam (y) + +30 +25 +20 +15 +10 +5 +0 +−5 +−10 +−80 + +−60 + +−40 + +−20 +0 +20 +Change in water level (x) + +40 + +60 + +80 + +Figure 8: Polynomial fit, λ = 100 +and testing set well. A good choice of λ (e.g., λ = 1) can provide a good fit +to the data. +In this section, you will implement an automated method to select the +λ parameter. Concretely, you will use a cross validation set to evaluate +how good each λ value is. After selecting the best λ value using the cross + +11 + + validation set, we can then evaluate the model on the test set to estimate +how well the model will perform on actual unseen data. +Your task is to complete the code in validationCurve.m. Specifically, +you should should use the trainLinearReg function to train the model using +different values of λ and compute the training error and cross validation error. +You should try λ in the following range: {0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10}. +20 +Train +Cross Validation + +18 +16 +14 + +Error + +12 +10 +8 +6 +4 +2 +0 + +0 + +1 + +2 + +3 + +4 + +5 +lambda + +6 + +7 + +8 + +9 + +10 + +Figure 9: Selecting λ using a cross validation set +After you have completed the code, the next part of ex5.m will run your +function can plot a cross validation curve of error v.s. λ that allows you select +which λ parameter to use. You should see a plot similar to Figure 9. In this +figure, we can see that the best value of λ is around 3. Due to randomness +in the training and validation splits of the dataset, the cross validation error +can sometimes be lower than the training error. +You should now submit your validation curve function. + +3.4 + +Optional (ungraded) exercise: Computing test set +error + +In the previous part of the exercise, you implemented code to compute the +cross validation error for various values of the regularization parameter λ. +However, to get a better indication of the model’s performance in the real +world, it is important to evaluate the “final” model on a test set that was +12 + + not used in any part of training (that is, it was neither used to select the λ +parameters, nor to learn the model parameters θ). +For this optional (ungraded) exercise, you should compute the test error +using the best value of λ you found. In our cross validation, we obtained a +test error of 3.8599 for λ = 3. +You do not need to submit any solutions for this optional (ungraded) +exercise. + +3.5 + +Optional (ungraded) exercise: Plotting learning +curves with randomly selected examples + +In practice, especially for small training sets, when you plot learning curves +to debug your algorithms, it is often helpful to average across multiple sets +of randomly selected examples to determine the training error and cross +validation error. +Concretely, to determine the training error and cross validation error for +i examples, you should first randomly select i examples from the training set +and i examples from the cross validation set. You will then learn the parameters θ using the randomly chosen training set and evaluate the parameters +θ on the randomly chosen training set and cross validation set. The above +steps should then be repeated multiple times (say 50) and the averaged error +should be used to determine the training error and cross validation error for +i examples. +For this optional (ungraded) exercise, you should implement the above +strategy for computing the learning curves. For reference, figure 10 shows the +learning curve we obtained for polynomial regression with λ = 0.01. Your +figure may differ slightly due to the random selection of examples. +You do not need to submit any solutions for this optional (ungraded) +exercise. + +13 + + Polynomial Regression Learning Curve (lambda = 0.010000) + +100 + +Train +Cross Validation + +90 +80 +70 + +Error + +60 +50 +40 +30 +20 +10 +0 + +0 + +2 + +4 + +6 +8 +Number of training examples + +10 + +12 + +Figure 10: Optional (ungraded) exercise: Learning curve with randomly +selected examples + +Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Part +Regularized Linear Regression Cost +Function +Regularized Linear Regression Gradient +Learning Curve +Polynomial Feature Mapping +Cross Validation Curve +Total Points + +Submitted File +linearRegCostFunction.m + +Points +25 points + +linearRegCostFunction.m + +25 points + +learningCurve.m +polyFeatures.m +validationCurve.m + +20 points +10 points +20 points +100 points + +You are allowed to submit your solutions multiple times, and we will take +only the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, November 20th +at 23:59:59 PDT. + +14 + + \ No newline at end of file diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/ex5.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/ex5.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/ex5.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/ex5.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/ex5data1.mat b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/ex5data1.mat similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/ex5data1.mat rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/ex5data1.mat diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/featureNormalize.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/featureNormalize.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/featureNormalize.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/featureNormalize.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/fmincg.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/fmincg.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/fmincg.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/fmincg.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/learningCurve.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/learningCurve.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/learningCurve.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/learningCurve.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/linearRegCostFunction.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/linearRegCostFunction.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/linearRegCostFunction.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/linearRegCostFunction.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/plotFit.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/plotFit.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/plotFit.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/plotFit.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/polyFeatures.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/polyFeatures.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/polyFeatures.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/polyFeatures.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/submit.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/submit.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/submit.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/submit.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/submitWeb.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/submitWeb.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/submitWeb.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/submitWeb.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/trainLinearReg.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/trainLinearReg.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/trainLinearReg.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/trainLinearReg.m diff --git a/Regularized linear regression and bias-variance/mlclass-ex5/validationCurve.m b/RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/validationCurve.m similarity index 100% rename from Regularized linear regression and bias-variance/mlclass-ex5/validationCurve.m rename to RegularizedLinearRegressionAndBiasVariance/mlclass-ex5/validationCurve.m diff --git a/Support Vector Machines/mlclass-ex6/octave-core b/Support Vector Machines/mlclass-ex6/octave-core deleted file mode 100644 index 536c65b..0000000 Binary files a/Support Vector Machines/mlclass-ex6/octave-core and /dev/null differ diff --git a/Support Vector Machines/ex6.pdf b/Support_Vector_Machines/ex6.pdf similarity index 100% rename from Support Vector Machines/ex6.pdf rename to Support_Vector_Machines/ex6.pdf diff --git a/Support_Vector_Machines/ex6.txt b/Support_Vector_Machines/ex6.txt new file mode 100644 index 0000000..79550ec --- /dev/null +++ b/Support_Vector_Machines/ex6.txt @@ -0,0 +1,803 @@ +Programming Exercise 6: +Support Vector Machines +Machine Learning +November 19, 2011 + +Introduction +In this exercise, you will be using support vector machines (SVMs) to build +a spam classifier. Before starting on the programming exercise, we strongly +recommend watching the video lectures and completing the review questions +for the associated topics. +To get started with the exercise, you will need to download the starter +code and unzip its contents to the directory where you wish to complete +the exercise. If needed, use the cd command in Octave to change to this +directory before starting this exercise. + +Files included in this exercise +ex6.m - Octave script for the first half of the exercise +ex6data1.mat - Example Dataset 1 +ex6data2.mat - Example Dataset 2 +ex6data3.mat - Example Dataset 3 +svmTrain.m - SVM rraining function +svmPredict.m - SVM prediction function +plotData.m - Plot 2D data +visualizeBoundaryLinear.m - Plot linear boundary +visualizeBoundary.m - Plot non-linear boundary +linearKernel.m - Linear kernel for SVM +[ ] gaussianKernel.m - Gaussian kernel for SVM +[ ] dataset3Params.m - Parameters to use for Dataset 3 + +1 + + ex6 spam.m - Octave script for the second half of the exercise +spamTrain.mat - Spam training set +spamTest.mat - Spam test set +emailSample1.txt - Sample email 1 +emailSample2.txt - Sample email 2 +spamSample1.txt - Sample spam 1 +spamSample2.txt - Sample spam 2 +vocab.txt - Vocabulary list +getVocabList.m - Load vocabulary list +porterStemmer.m - Stemming function +readFile.m - Reads a file into a character string +submit.m - Submission script that sends your solutions to our servers +submitWeb.m - Alternative submission script +[ ] processEmail.m - Email preprocessing +[ ] emailFeatures.m - Feature extraction from emails +indicates files you will need to complete +Throughout the exercise, you will be using the script ex6.m. These scripts +set up the dataset for the problems and make calls to functions that you will +write. You are only required to modify functions in other files, by following +the instructions in this assignment. + +Where to get help +We also strongly encourage using the online Q&A Forum to discuss discuss +exercises with other students. However, do not look at any source code +written by others or share your source code with others. +If you run into network errors using the submit script, you can also use +an online form for submitting your solutions. To use this alternative submission interface, run the submitWeb script to generate a submission file (e.g., +submit ex6 part2.txt). You can then submit this file through the web +submission form in the programming exercises page (go to the programming +exercises page, then select the exercise you are submitting for). If you are +having no problems submitting through the standard submission system using the submit script, you do not need to use this alternative submission +interface. + +2 + + 1 + +Support Vector Machines + +In the first half of this exercise, you will be using support vector machines +(SVMs) with various example 2D datasets. Experimenting with these datasets +will help you gain an intuition of how SVMs work and how to use a Gaussian +kernel with SVMs. In the next half of the exercise, you will be using support +vector machines to build a spam classifier. +The provided script, ex6.m, will help you step through the first half of +the exercise. + +1.1 + +Example Dataset 1 + +We will begin by with a 2D example dataset which can be separated by a +linear boundary. The script ex6.m will plot the training data (Figure 1). In +this dataset, the positions of the positive examples (indicated with +) and the +negative examples (indicated with o) suggest a natural separation indicated +by the gap. However, notice that there is an outlier positive example + on +the far left at about (0.1, 4.1). As part of this exercise, you will also see how +this outlier affects the SVM decision boundary. +5 + +4.5 + +4 + +3.5 + +3 + +2.5 + +2 + +1.5 + +0 + +0.5 + +1 + +1.5 + +2 + +2.5 + +3 + +3.5 + +4 + +4.5 + +Figure 1: Example Dataset 1 +In this part of the exercise, you will try using different values of the C +parameter with SVMs. Informally, the C parameter is a positive value that +controls the penalty for misclassified training examples. A large C parameter +3 + + tells the SVM to try to classify all the examples correctly. C plays a role +similar to λ1 , where λ is the regularization parameter that we were using +previously for logistic regression. +5 + +4.5 + +4 + +3.5 + +3 + +2.5 + +2 + +1.5 + +0 + +0.5 + +1 + +1.5 + +2 + +2.5 + +3 + +3.5 + +4 + +4.5 + +Figure 2: SVM Decision Boundary with C = 1 (Example Dataset 1) + +5 + +4.5 + +4 + +3.5 + +3 + +2.5 + +2 + +1.5 + +0 + +0.5 + +1 + +1.5 + +2 + +2.5 + +3 + +3.5 + +4 + +4.5 + +Figure 3: SVM Decision Boundary with C = 100 (Example Dataset 1) +The next part in ex6.m will run the SVM training (with C = 1) using +4 + + SVM software that we have included with the starter code, svmTrain.m.1 +When C = 1, you should find that the SVM puts the decision boundary in +the gap between the two datasets and misclassifies the data point on the far +left (Figure 2). +Implementation Note: Most SVM software packages (including +svmTrain.m) automatically add the extra feature x0 = 1 for you and +automatically take care of learning the intercept term θ0 . So when passing your training data to the SVM software, there is no need to add this +extra feature x0 = 1 yourself. In particular, in Octave your code should +be working with training examples x ∈ n (rather than x ∈ n+1 ); for +example, in the first example dataset x ∈ 2 . +Your task is to try different values of C on this dataset. Specifically, you +should change the value of C in the script to C = 100 and run the SVM +training again. When C = 100, you should find that the SVM now classifies +every single example correctly, but has a decision boundary that does not +appear to be a natural fit for the data (Figure 3). + +1.2 + +SVM with Gaussian Kernels + +In this part of the exercise, you will be using SVMs to do non-linear classification. In particular, you will be using SVMs with Gaussian kernels on +datasets that are not linearly separable. +1.2.1 + +Gaussian Kernel + +To find non-linear decision boundaries with the SVM, we need to first implement a Gaussian kernel. You can think of the Gaussian kernel as a similarity function that measures the “distance” between a pair of examples, +(x(i) , x(j) ). The Gaussian kernel is also parameterized by a bandwidth parameter, σ, which determines how fast the similarity metric decreases (to 0) +as the examples are further apart. +You should now complete the code in gaussianKernel.m to compute +the Gaussian kernel between two examples, (x(i) , x(j) ). The Gaussian kernel +1 + +In order to ensure compatibility with Octave, we have included this implementation +of an SVM learning algorithm. However, this particular implementation was chosen to +maximize compatibility, and is not very efficient. If you are training an SVM on a real +problem, especially if you need to scale to a larger dataset, we strongly recommend instead +using a highly optimized SVM toolbox such as LIBSVM. + +5 + + function is defined as: + +x(i) − x(j) +Kgaussian (x , x ) = exp − +2σ 2 +(i) + +2 + + += exp  +− + +(j) + +n + +(i) +(xk +k=1 + +− + +2σ 2 + + + +(j) +xk ) 2 + + +. + + +Once you’ve completed the function gaussianKernel.m, the script ex6.m +will test your kernel function on two provided examples and you should expect to see a value of 0.324652. +You should now submit your function that computes the Gaussian kernel. +1.2.2 + +Example Dataset 2 +1 + +0.9 + +0.8 + +0.7 + +0.6 + +0.5 + +0.4 + +0 + +0.1 + +0.2 + +0.3 + +0.4 + +0.5 + +0.6 + +0.7 + +0.8 + +0.9 + +1 + +Figure 4: Example Dataset 2 +The next part in ex6.m will load and plot dataset 2 (Figure 4). From +the figure, you can obserse that there is no linear decision boundary that +separates the positive and negative examples for this dataset. However, by +using the Gaussian kernel with the SVM, you will be able to learn a non-linear +decision boundary that can perform reasonably well for the dataset. +If you have correctly implemented the Gaussian kernel function, ex6.m +will proceed to train the SVM with the Gaussian kernel on this dataset. +6 + + 1 + +0.9 + +0.8 + +0.7 + +0.6 + +0.5 + +0.4 + +0 + +0.1 + +0.2 + +0.3 + +0.4 + +0.5 + +0.6 + +0.7 + +0.8 + +0.9 + +1 + +Figure 5: SVM (Gaussian Kernel) Decision Boundary (Example Dataset 2) +Figure 5 shows the decision boundary found by the SVM with a Gaussian +kernel. The decision boundary is able to separate most of the positive and +negative examples correctly and follows the contours of the dataset well. +1.2.3 + +Example Dataset 3 + +In this part of the exercise, you will gain more practical skills on how to use +a SVM with a Gaussian kernel. The next part of ex6.m will load and display +a third dataset (Figure 6). You will be using the SVM with the Gaussian +kernel with this dataset. +In the provided dataset, ex6data3.mat, you are given the variables X, +y, Xval, yval. The provided code in ex6.m trains the SVM classifier using +the training set (X, y) using parameters loaded from dataset3Params.m. +Your task is to use the cross validation set Xval, yval to determine the +best C and σ parameter to use. You should write any additional code necessary to help you search over the parameters C and σ. For both C and σ, we +suggest trying values in multiplicative steps (e.g., 0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30). +Note that you should try all possible pairs of values for C and σ (e.g., C = 0.3 +and σ = 0.1). For example, if you try each of the 8 values listed above for C +and for σ 2 , you would end up training and evaluating (on the cross validation +set) a total of 82 = 64 different models. +After you have determined the best C and σ parameters to use, you +should modify the code in dataset3Params.m, filling in the best parameters +7 + + 0.6 + +0.4 + +0.2 + +0 + +−0.2 + +−0.4 + +−0.6 + +−0.8 +−0.6 + +−0.5 + +−0.4 + +−0.3 + +−0.2 + +−0.1 + +0 + +0.1 + +0.2 + +0.3 + +0.2 + +0.3 + +Figure 6: Example Dataset 3 +0.6 + +0.4 + +0.2 + +0 + +−0.2 + +−0.4 + +−0.6 + +−0.8 +−0.6 + +−0.5 + +−0.4 + +−0.3 + +−0.2 + +−0.1 + +0 + +0.1 + +Figure 7: SVM (Gaussian Kernel) Decision Boundary (Example Dataset 3) +you found. For our best parameters, the SVM returned a decision boundary +shown in Figure 7. + +8 + + Implementation Tip: When implementing cross validation to select the +best C and σ parameter to use, you need to evaluate the error on the cross +validation set. Recall that for classification, the error is defined as the +fraction of the cross validation examples that were classified incorrectly. +In Octave, you can compute this error using mean(double(predictions +~= yval)), where predictions is a vector containing all the predictions +from the SVM, and yval are the true labels from the cross validation set. +You can use the svmPredict function to generate the predictions for the +cross validation set. +You should now submit your best C and σ values. + +9 + + 2 + +Spam Classification + +Many email services today provide spam filters that are able to classify emails +into spam and non-spam email with high accuracy. In this part of the exercise, you will use SVMs to build your own spam filter. +You will be training a classifier to classify whether a given email, x, is +spam (y = 1) or non-spam (y = 0). In particular, you need to convert each +email into a feature vector x ∈ Rn . The following parts of the exercise will +walk you through how such a feature vector can be constructed from an +email. +Throughout the rest of this exercise, you will be using the the script +ex6 spam.m. The dataset included for this exercise is based on a a subset of +the SpamAssassin Public Corpus.2 For the purpose of this exercise, you will +only be using the body of the email (excluding the email headers). + +2.1 + +Preprocessing Emails + +> Anyone knows how much it costs to host a web portal ? +> +Well, it depends on how many visitors youre expecting. This can be +anywhere from less than 10 bucks a month to a couple of ✩100. You +should checkout http://www.rackspace.com/ or perhaps Amazon EC2 if +youre running something big.. +To unsubscribe yourself from this mailing list, send an email to: +groupname-unsubscribe@egroups.com + +Figure 8: Sample Email +Before starting on a machine learning task, it is usually insightful to +take a look at examples from the dataset. Figure 8 shows a sample email +that contains a URL, an email address (at the end), numbers, and dollar +amounts. While many emails would contain similar types of entities (e.g., +numbers, other URLs, or other email addresses), the specific entities (e.g., +the specific URL or specific dollar amount) will be different in almost every +email. Therefore, one method often employed in processing emails is to +“normalize” these values, so that all URLs are treated the same, all numbers +are treated the same, etc. For example, we could replace each URL in the +email with the unique string “httpaddr” to indicate that a URL was present. +2 + +http://spamassassin.apache.org/publiccorpus/ + +10 + + This has the effect of letting the spam classifier make a classification decision +based on whether any URL was present, rather than whether a specific URL +was present. This typically improves the performance of a spam classifier, +since spammers often randomize the URLs, and thus the odds of seeing any +particular URL again in a new piece of spam is very small. +In processEmail.m, we have implemented the following email preprocessing and normalization steps: +❼ Lower-casing: +The entire email is converted into lower case, so +that captialization is ignored (e.g., IndIcaTE is treated the same as +Indicate). +❼ Stripping HTML: All HTML tags are removed from the emails. +Many emails often come with HTML formatting; we remove all the +HTML tags, so that only the content remains. +❼ Normalizing URLs: All URLs are replaced with the text “httpaddr”. +❼ Normalizing Email Addresses: +with the text “emailaddr”. +❼ Normalizing Numbers: +“number”. + +All email addresses are replaced + +All numbers are replaced with the text + +❼ Normalizing Dollars: All dollar signs (✩) are replaced with the text +“dollar”. +❼ Word Stemming: Words are reduced to their stemmed form. For example, “discount”, “discounts”, “discounted” and “discounting” are all +replaced with “discount”. Sometimes, the Stemmer actually strips off +additional characters from the end, so “include”, “includes”, “included”, +and “including” are all replaced with “includ”. +❼ Removal of non-words: Non-words and punctuation have been removed. All white spaces (tabs, newlines, spaces) have all been trimmed +to a single space character. + +The result of these preprocessing steps is shown in Figure 9. While preprocessing has left word fragments and non-words, this form turns out to be +much easier to work with for performing feature extraction. + +11 + + anyon know how much it cost to host a web portal well it depend on how +mani visitor your expect thi can be anywher from less than number buck +a month to a coupl of dollarnumb you should checkout httpaddr or perhap +amazon ecnumb if your run someth big to unsubscrib yourself from thi +mail list send an email to emailaddr + +Figure 9: Preprocessed Sample Email +86 916 794 1077 883 +370 1699 790 1822 +1831 883 431 1171 +794 1002 1893 1364 +592 1676 238 162 89 +688 945 1663 1120 +1062 1699 375 1162 +479 1893 1510 799 +1182 1237 810 1895 +1440 1547 181 1699 +1758 1896 688 1676 +992 961 1477 71 530 +1699 531 + +1 aa +2 ab +3 abil +... +86 anyon +... +916 know +... +1898 zero +1899 zip + +Figure 10: Vocabulary List +2.1.1 + +Figure 11: Word Indices for Sample Email + +Vocabulary List + +After preprocessing the emails, we have a list of words (e.g., Figure 9) for +each email. The next step is to choose which words we would like to use in +our classifier and which we would want to leave out. +For this exercise, we have chosen only the most frequently occuring words +as our set of words considered (the vocabulary list). Since words that occur +rarely in the training set are only in a few emails, they might cause the +model to overfit our training set. The complete vocabulary list is in the file +vocab.txt and also shown in Figure 10. Our vocabulary list was selected +by choosing all words which occur at least a 100 times in the spam corpus, +resulting in a list of 1899 words. In practice, a vocabulary list with about +10,000 to 50,000 words is often used. +Given the vocabulary list, we can now map each word in the preprocessed +emails (e.g., Figure 9) into a list of word indices that contains the index +of the word in the vocabulary list. Figure 11 shows the mapping for the +sample email. Specifically, in the sample email, the word “anyone” was first +normalized to “anyon” and then mapped onto the index 86 in the vocabulary +list. +Your task now is to complete the code in processEmail.m to perform +12 + + this mapping. In the code, you are given a string str which is a single word +from the processed email. You should look up the word in the vocabulary +list vocabList and find if the word exists in the vocabulary list. If the word +exists, you should add the index of the word into the word indices variable. +If the word does not exist, and is therefore not in the vocabulary, you can +skip the word. +Once you have implemented processEmail.m, the script ex6 spam.m will +run your code on the email sample and you should see an output similar to +Figures 9 & 11. +Octave Tip: In Octave, you can compare two strings with the strcmp +function. For example, strcmp(str1, str2) will return 1 only when +both strings are equal. In the provided starter code, vocabList is a “cellarray” containing the words in the vocabulary. In Octave, a cell-array is +just like a normal array (i.e., a vector), except that its elements can also +be strings (which they can’t in a normal Octave matrix/vector), and you +index into them using curly braces instead of square brackets. Specifically, +to get the word at index i, you can use vocabList{i}. You can also use +length(vocabList) to get the number of words in the vocabulary. +You should now submit the email preprocessing function. + +2.2 + +Extracting Features from Emails + +You will now implement the feature extraction that converts each email into +a vector in Rn . For this exercise, you will be using n = # words in vocabulary +list. Specifically, the feature xi ∈ {0, 1} for an email corresponds to whether +the i-th word in the dictionary occurs in the email. That is, xi = 1 if the i-th +word is in the email and xi = 0 if the i-th word is not present in the email. +Thus, for a typical email, this feature would look like: + +13 + +  + + +0 + ..  + .  +  + 1  +  + 0  + .  +n + +x= + ..  ∈ R . + 1  +  + 0  +  + .  + ..  +0 +You should now complete the code in emailFeatures.m to generate a +feature vector for an email, given the word indices. +Once you have implemented emailFeatures.m, the next part of ex6 spam.m +will run your code on the email sample. You should see that the feature vector had length 1899 and 45 non-zero entries. +You should now submit the email feature extraction function. + +2.3 + +Training SVM for Spam Classification + +After you have completed the feature extraction functions, the next step of +ex6 spam.m will load a preprocessed training dataset that will be used to train +a SVM classifier. spamTrain.mat contains 4000 training examples of spam +and non-spam email, while spamTest.mat contains 1000 test examples. Each +original email was processed using the processEmail and emailFeatures +functions and converted into a vector x(i) ∈ R1899 . +After loading the dataset, ex6 spam.m will proceed to train a SVM to +classify between spam (y = 1) and non-spam (y = 0) emails. Once the +training completes, you should see that the classifier gets a training accuracy +of about 99.8% and a test accuracy of about 98.5%. + +2.4 + +Top Predictors for Spam + +our click remov guarante visit basenumb dollar will price pleas nbsp +most lo ga dollarnumb + +Figure 12: Top predictors for spam email + +14 + + To better understand how the spam classifier works, we can inspect the +parameters to see which words the classifier thinks are the most predictive +of spam. The next step of ex6 spam.m finds the parameters with the largest +positive values in the classifier and displays the corresponding words (Figure +12). Thus, if an email contains words such as “guarantee”, “remove”, “dollar”, and “price” (the top predictors shown in Figure 12), it is likely to be +classified as spam. + +2.5 + +Optional (ungraded) exercise: Try your own emails + +Now that you have trained a spam classifier, you can start trying it out on +your own emails. In the starter code, we have included two email examples (emailSample1.txt and emailSample2.txt) and two spam examples +(spamSample1.txt and spamSample2.txt). The last part of ex6 spam.m +runs the spam classifier over the first spam example and classifies it using +the learned SVM. You should now try the other examples we have provided +and see if the classifier gets them right. You can also try your own emails by +replacing the examples (plain text files) with your own emails. +You do not need to submit any solutions for this optional (ungraded) +exercise. + +2.6 + +Optional (ungraded) exercise: Build your own dataset + +In this exercise, we provided a preprocessed training set and test set. These +datasets were created using the same functions (processEmail.m and emailFeatures.m) +that you now have completed. For this optional (ungraded) exercise, you will +build your own dataset using the original emails from the SpamAssassin Public Corpus. +Your task in this optional (ungraded) exercise is to download the original +files from the public corpus and extract them. After extracting them, you +should run the processEmail3 and emailFeatures functions on each email +to extract a feature vector from each email. This will allow you to build a +dataset X, y of examples. You should then randomly divide up the dataset +into a training set, a cross validation set and a test set. +While you are building your own dataset, we also encourage you to try +building your own vocabulary list (by selecting the high frequency words +3 + +The original emails will have email headers that you might wish to leave out. We have +included code in processEmail that will help you remove these headers. + +15 + + that occur in the dataset) and adding any additional features that you think +might be useful. +Finally, we also suggest trying to use highly optimized SVM toolboxes +such as LIBSVM. +You do not need to submit any solutions for this optional (ungraded) +exercise. + +Submission and Grading +After completing various parts of the assignment, be sure to use the submit +function system to submit your solutions to our servers. The following is a +breakdown of how each part of this exercise is scored. +Submitted File +gaussianKernel.m +dataset3Params.m +processEmail.m +emailFeatures.m + +Part +Gaussian Kernel +Parameters (C, σ) for Dataset 3 +Email Preprocessing +Email Feature Extraction +Total Points + +Points +25 points +25 points +25 points +25 points +100 points + +You are allowed to submit your solutions multiple times, and we will take +only the highest score into consideration. To prevent rapid-fire guessing, the +system enforces a minimum of 5 minutes between submissions. +All parts of this programming exercise are due Sunday, November 27th +at 23:59:59 PDT. + +16 + + \ No newline at end of file diff --git a/Support Vector Machines/mlclass-ex6/dataset3Params.m b/Support_Vector_Machines/mlclass-ex6/dataset3Params.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/dataset3Params.m rename to Support_Vector_Machines/mlclass-ex6/dataset3Params.m diff --git a/Support Vector Machines/mlclass-ex6/emailFeatures.m b/Support_Vector_Machines/mlclass-ex6/emailFeatures.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/emailFeatures.m rename to Support_Vector_Machines/mlclass-ex6/emailFeatures.m diff --git a/Support Vector Machines/mlclass-ex6/emailSample1.txt b/Support_Vector_Machines/mlclass-ex6/emailSample1.txt similarity index 100% rename from Support Vector Machines/mlclass-ex6/emailSample1.txt rename to Support_Vector_Machines/mlclass-ex6/emailSample1.txt diff --git a/Support Vector Machines/mlclass-ex6/emailSample2.txt b/Support_Vector_Machines/mlclass-ex6/emailSample2.txt similarity index 100% rename from Support Vector Machines/mlclass-ex6/emailSample2.txt rename to Support_Vector_Machines/mlclass-ex6/emailSample2.txt diff --git a/Support Vector Machines/mlclass-ex6/ex6.m b/Support_Vector_Machines/mlclass-ex6/ex6.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/ex6.m rename to Support_Vector_Machines/mlclass-ex6/ex6.m diff --git a/Support Vector Machines/mlclass-ex6/ex6_spam.m b/Support_Vector_Machines/mlclass-ex6/ex6_spam.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/ex6_spam.m rename to Support_Vector_Machines/mlclass-ex6/ex6_spam.m diff --git a/Support Vector Machines/mlclass-ex6/ex6data1.mat b/Support_Vector_Machines/mlclass-ex6/ex6data1.mat similarity index 100% rename from Support Vector Machines/mlclass-ex6/ex6data1.mat rename to Support_Vector_Machines/mlclass-ex6/ex6data1.mat diff --git a/Support Vector Machines/mlclass-ex6/ex6data2.mat b/Support_Vector_Machines/mlclass-ex6/ex6data2.mat similarity index 100% rename from Support Vector Machines/mlclass-ex6/ex6data2.mat rename to Support_Vector_Machines/mlclass-ex6/ex6data2.mat diff --git a/Support Vector Machines/mlclass-ex6/ex6data3.mat b/Support_Vector_Machines/mlclass-ex6/ex6data3.mat similarity index 100% rename from Support Vector Machines/mlclass-ex6/ex6data3.mat rename to Support_Vector_Machines/mlclass-ex6/ex6data3.mat diff --git a/Support Vector Machines/mlclass-ex6/gaussianKernel.m b/Support_Vector_Machines/mlclass-ex6/gaussianKernel.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/gaussianKernel.m rename to Support_Vector_Machines/mlclass-ex6/gaussianKernel.m diff --git a/Support Vector Machines/mlclass-ex6/getVocabList.m b/Support_Vector_Machines/mlclass-ex6/getVocabList.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/getVocabList.m rename to Support_Vector_Machines/mlclass-ex6/getVocabList.m diff --git a/Support Vector Machines/mlclass-ex6/linearKernel.m b/Support_Vector_Machines/mlclass-ex6/linearKernel.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/linearKernel.m rename to Support_Vector_Machines/mlclass-ex6/linearKernel.m diff --git a/Support Vector Machines/mlclass-ex6/plotData.m b/Support_Vector_Machines/mlclass-ex6/plotData.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/plotData.m rename to Support_Vector_Machines/mlclass-ex6/plotData.m diff --git a/Support Vector Machines/mlclass-ex6/porterStemmer.m b/Support_Vector_Machines/mlclass-ex6/porterStemmer.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/porterStemmer.m rename to Support_Vector_Machines/mlclass-ex6/porterStemmer.m diff --git a/Support Vector Machines/mlclass-ex6/processEmail.m b/Support_Vector_Machines/mlclass-ex6/processEmail.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/processEmail.m rename to Support_Vector_Machines/mlclass-ex6/processEmail.m diff --git a/Support Vector Machines/mlclass-ex6/readFile.m b/Support_Vector_Machines/mlclass-ex6/readFile.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/readFile.m rename to Support_Vector_Machines/mlclass-ex6/readFile.m diff --git a/Support Vector Machines/mlclass-ex6/spamSample1.txt b/Support_Vector_Machines/mlclass-ex6/spamSample1.txt similarity index 100% rename from Support Vector Machines/mlclass-ex6/spamSample1.txt rename to Support_Vector_Machines/mlclass-ex6/spamSample1.txt diff --git a/Support Vector Machines/mlclass-ex6/spamSample2.txt b/Support_Vector_Machines/mlclass-ex6/spamSample2.txt similarity index 100% rename from Support Vector Machines/mlclass-ex6/spamSample2.txt rename to Support_Vector_Machines/mlclass-ex6/spamSample2.txt diff --git a/Support Vector Machines/mlclass-ex6/spamTest.mat b/Support_Vector_Machines/mlclass-ex6/spamTest.mat similarity index 100% rename from Support Vector Machines/mlclass-ex6/spamTest.mat rename to Support_Vector_Machines/mlclass-ex6/spamTest.mat diff --git a/Support Vector Machines/mlclass-ex6/spamTrain.mat b/Support_Vector_Machines/mlclass-ex6/spamTrain.mat similarity index 100% rename from Support Vector Machines/mlclass-ex6/spamTrain.mat rename to Support_Vector_Machines/mlclass-ex6/spamTrain.mat diff --git a/Support Vector Machines/mlclass-ex6/submit.m b/Support_Vector_Machines/mlclass-ex6/submit.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/submit.m rename to Support_Vector_Machines/mlclass-ex6/submit.m diff --git a/Support Vector Machines/mlclass-ex6/submitWeb.m b/Support_Vector_Machines/mlclass-ex6/submitWeb.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/submitWeb.m rename to Support_Vector_Machines/mlclass-ex6/submitWeb.m diff --git a/Support Vector Machines/mlclass-ex6/svmPredict.m b/Support_Vector_Machines/mlclass-ex6/svmPredict.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/svmPredict.m rename to Support_Vector_Machines/mlclass-ex6/svmPredict.m diff --git a/Support Vector Machines/mlclass-ex6/svmTrain.m b/Support_Vector_Machines/mlclass-ex6/svmTrain.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/svmTrain.m rename to Support_Vector_Machines/mlclass-ex6/svmTrain.m diff --git a/Support Vector Machines/mlclass-ex6/visualizeBoundary.m b/Support_Vector_Machines/mlclass-ex6/visualizeBoundary.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/visualizeBoundary.m rename to Support_Vector_Machines/mlclass-ex6/visualizeBoundary.m diff --git a/Support Vector Machines/mlclass-ex6/visualizeBoundaryLinear.m b/Support_Vector_Machines/mlclass-ex6/visualizeBoundaryLinear.m similarity index 100% rename from Support Vector Machines/mlclass-ex6/visualizeBoundaryLinear.m rename to Support_Vector_Machines/mlclass-ex6/visualizeBoundaryLinear.m diff --git a/Support Vector Machines/mlclass-ex6/vocab.txt b/Support_Vector_Machines/mlclass-ex6/vocab.txt similarity index 100% rename from Support Vector Machines/mlclass-ex6/vocab.txt rename to Support_Vector_Machines/mlclass-ex6/vocab.txt