diff --git a/Assignment 4.Rmd b/Assignment 4.Rmd index 54b0e66..27c2918 100644 --- a/Assignment 4.Rmd +++ b/Assignment 4.Rmd @@ -8,13 +8,19 @@ https://www.cs.uic.edu/~wilkinson/Applets/cluster.html ```{r} -library() +library(igraph) +library(dplyr) +library(tidyr) +library(tidyverse) +library(ggplot2) ``` Now, upload the file "Class_Motivation.csv" from the Assignment 4 Repository as a data frame called "K1"" ```{r} -K1 <- read.csv(...) +K1 <- read.csv("Class_Motivation.csv", header=T) +K1b<-gather(K1,week, measure,2:6) +plot(as.factor(K1b$week),K1b$measure) ``` @@ -26,12 +32,15 @@ The algorithm will treat each row as a value belonging to a person, so we need t ```{r} -K2 <- +K2 <- select(K1, 2:6) ``` It is important to think about the meaning of missing values when clustering. We could treat them as having meaning or we could remove those people who have them. Neither option is ideal. What problems do you foresee if we recode or remove these values? Write your answers below: +```{r} +# Recoding or removing missing values will either change the results or cause bias. When modifying the original data, the representativeness of the sample will be deducted. +``` We will remove people with missing values for this assignment, but keep in mind the issues that you have identified. @@ -41,13 +50,17 @@ We will remove people with missing values for this assignment, but keep in mind K3 <- na.omit(K2) #This command create a data frame with only those people with no missing values. It "omits" all rows with missing values, also known as a "listwise deletion". EG - It runs down the list deleting rows as it goes. +# or make missing value = 0 +# K3<-K2 +# K3[is.na(K3)]<-0 + ``` Another pre-processing step used in K-means is to standardize the values so that they have the same range. We do this because we want to treat each week as equally important - if we do not standardise then the week with the largest range will have the greatest impact on which clusters are formed. We standardise the values by using the "scale()" command. ```{r} -K3 <- +K3 <- scale(K3) ``` @@ -66,20 +79,38 @@ Also, we need to choose the number of clusters we think are in the data. We will ```{r} -fit <- +fit1a <- kmeans(K3,2) +fit1b <- kmeans(K3,2) +fit1c <- kmeans(K3,2) + #We have created an object called "fit" that contains all the details of our clustering including which observations belong to each cluster. #We can access the list of clusters by typing "fit$cluster", the top row corresponds to the original order the rows were in. Notice we have deleted some rows. - +fit1a$cluster #We can also attach these clusters to the original dataframe by using the "data.frame" command to create a new data frame called K4. -K4 +K4<-data.frame(K3, fit1a$cluster,fit1b$cluster,fit1c$cluster) -#Have a look at the K4 dataframe. Lets change the names of the variables to make it more convenient with the names() command. +fit1a$withinss +fit1b$withinss +fit1c$withinss +fit1a$tot.withinss +fit1b$tot.withinss +fit1c$tot.withinss + +fit1a$betweenss +fit1b$betweenss +fit1c$betweenss + +K4<-data.frame(K3, fit1c$cluster) + + +#Have a look at the K4 dataframe. Lets change the names of the variables to make it more convenient with the names() command. +names(K4)<-c("1","2","3","4","5","cluster") # c()stands for concatonate and it creates a vector of anything, in this case a vector of names. ``` @@ -88,14 +119,15 @@ Now we need to visualize the clusters we have created. To do so we want to play First lets use tidyr to convert from wide to long format. ```{r} -K5 <- gather(K4, "week", "motivation", 1:5) +K5 <- tidyr::gather(K4, "week","motivation",1:5) + ``` Now lets use dplyr to average our motivation values by week and by cluster. ```{r} -K6 <- K5 %>% group_by(week, cluster) %>% summarise(K6, avg = mean(motivation)) +K6 <- K5 %>% group_by(week, cluster) %>% summarise(avg = mean(motivation)) ``` @@ -113,9 +145,9 @@ Likewise, since "cluster" is not numeric but rather a categorical label we want ```{r} -K6$week <- +K6$week <- as.numeric(K6$week) -K6$cluster <- +K6$cluster <- as.factor(K6$cluster) ``` @@ -128,32 +160,108 @@ Now we can plot our line plot using the ggplot command, "ggplot()". ```{r} -ggplot(K6, aes(week, avg, colour = cluster)) + geom_line() + xlab("Week") + ylab("Average Motivation") +ggplot(K6, aes(week, avg, colour = cluster)) + geom_line() + xlab("Week") + ylab("Average Motivation") ``` What patterns do you see in the plot? +```{r} +# Two clusters of students start with different motivations. Cluster 1 with high motivations to start dropped steeply and then rised stteply wirh the same speed back to the starting point. And then dropeed with the same slope to negative mativation. Cluster 2 start with negative motivation. It first rised slowly compared to clsuter 1 and then droped at the same week when cluster 1 dropped with the same speed in the first stage and the rised at the same week in cluster 1 with the same speed to a end point which is higher than clsuter 1. + +# Both clusters fluctuate at the same time with constant speed and different direction. Even though cluster 2 start with negative motication and have lower speed, cluster 2 have higher motivation than cluster 1 at the end. +``` It would be useful to determine how many people are in each cluster. We can do this easily with dplyr. ```{r} -K7 <- count(K4, cluster) +K7 <- dplyr::count(K4, cluster) ``` Look at the number of people in each cluster, now repeat this process for 3 rather than 2 clusters. Which cluster grouping do you think is more informative? Write your answer below: +```{r} +# The 3 cluster grouping is more informative compared to 2 cluster. Three clusters are based on three average levels of motivation: high, medium, and low. The lines that cross each other and fluctuate are informative in showing that motivations levels do not remain constant. +``` + + ##Part II Using the data collected in the HUDK4050 entrance survey (HUDK4050-cluster.csv) use K-means to cluster the students first according location (lat/long) and then according to their answers to the questions, each student should belong to two clusters. +```{r} +library(tidyverse) +library(dplyr) + +M1 <- read.csv("HUDK405020-cluster.csv", header = TRUE) +#create a dataframe that only includes the survey questions about hours +M2<-select(M1,4:9) +#dealing with missing values - there are + +# Generate clusters for survey questions +fit2a<-kmeans(M2,1) +fit2b<-kmeans(M2,2) +fit2c<-kmeans(M2,3) +fit2d<-kmeans(M2,4) +fit2e<-kmeans(M2,5) +fit2f<-kmeans(M2,6) +fit2g<-kmeans(M2,7) + +mss<-c(fit2a$tot.withinss,fit2b$tot.withinss,fit2c$tot.withinss,fit2d$tot.withinss,fit2e$tot.withinss,fit2f$tot.withinss,fit2g$tot.withinss,fit2a$betweenss,fit2b$betweenss,fit2c$betweenss,fit2d$betweenss,fit2e$betweenss,fit2f$betweenss,fit2g$betweenss) + +clusters<-c(seq(1,7,1),seq(1,7,1)) +col<-c(rep("blue",7),rep("red",7)) + +plot(clusters,mss,col=col) + +#create a dataframe that only includes location data +L1<-select(M1,2:3) +#L1<-unite(L1,place,Q1_1,Q1_2,sep=",") + +#request lattitude and longtitude from Google Maps ApI +#library(ggmap) +#L2<-geocode(as.character(L1$place),output="latlon",source="dsk") + +#generate clusters for lat/lon +plot(L1$long,L1$lat) + +fit3a<-kmeans(L1,2) +fit3b<-kmeans(L1,2) +fit3c<-kmeans(L1,2) + +fit3a$tot.withinss +fit3b$tot.withinss +fit3c$tot.withinss + +#combine everything tigether +ML<-data.frame(M1$compare.features,M1$math.accuracy,M1$planner.use,M1$enjoy.discuss,M1$enjoy.group,M1$meet.deadline,fit2c$cluster,M1$lat,M1$long,fit3a$cluster) + +pairs(ML) + + +``` + + ##Part III Create a visualization that shows the overlap between the two clusters each student belongs to in Part II. IE - Are there geographical patterns that correspond to the answers? ```{r} +# there are lots of ways to answer this. A common way was to do a scatterplot of students coloring the points with one set of clusters and using shapes for the other set of clusters. I think better way is to use a mosaic plot that can be generated either through ggplot or with a specific package called vd. + +table(ML$fit2c.cluster,ML$fit3a.cluster) +ML2 <- ML %>% group_by(fit2c.cluster,fit3a.cluster) %>% summarize(count = n()) +ggplot(ML2, aes(x = fit2c.cluster, y = fit3a.cluster, size = count)) + geom_point() + +library(vcd) +P1 <- structable(fit2c$cluster ~ fit3a$cluster) +mosaic(P1, shade=TRUE, legent=TRUE) + +# this shows how much overlap there are between the groups of clusters + +#There is a large cluster of people who seem to be geographically close, but within the large cluster, there are three smaller clusters with people who are close to each other geographically. Most people in the large cluster have high scores. ``` diff --git a/Assignment-4.html b/Assignment-4.html new file mode 100644 index 0000000..2a46636 --- /dev/null +++ b/Assignment-4.html @@ -0,0 +1,630 @@ + + + + + + + + + + + + + +Assignment 4: K Means Clustering + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

In this assignment we will be applying the K-means clustering algorithm we looked at in class. At the following link you can find a description of K-means:

+

https://www.cs.uic.edu/~wilkinson/Applets/cluster.html

+
library(igraph)
+
## 
+## Attaching package: 'igraph'
+
## The following objects are masked from 'package:stats':
+## 
+##     decompose, spectrum
+
## The following object is masked from 'package:base':
+## 
+##     union
+
library(dplyr)
+
## 
+## Attaching package: 'dplyr'
+
## The following objects are masked from 'package:igraph':
+## 
+##     as_data_frame, groups, union
+
## The following objects are masked from 'package:stats':
+## 
+##     filter, lag
+
## The following objects are masked from 'package:base':
+## 
+##     intersect, setdiff, setequal, union
+
library(tidyr)
+
## 
+## Attaching package: 'tidyr'
+
## The following object is masked from 'package:igraph':
+## 
+##     crossing
+
library(tidyverse)
+
## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
+
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
+## ✓ tibble  3.0.3     ✓ stringr 1.4.0
+## ✓ readr   1.3.1     ✓ forcats 0.5.0
+
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
+## x tibble::as_data_frame() masks dplyr::as_data_frame(), igraph::as_data_frame()
+## x purrr::compose()        masks igraph::compose()
+## x tidyr::crossing()       masks igraph::crossing()
+## x dplyr::filter()         masks stats::filter()
+## x dplyr::groups()         masks igraph::groups()
+## x dplyr::lag()            masks stats::lag()
+## x purrr::simplify()       masks igraph::simplify()
+
library(ggplot2)
+

Now, upload the file “Class_Motivation.csv” from the Assignment 4 Repository as a data frame called “K1”"

+
K1 <- read.csv("Class_Motivation.csv", header=T)
+K1b<-gather(K1,week, measure,2:6)
+plot(as.factor(K1b$week),K1b$measure)
+

+

This file contains the self-reported motivation scores for a class over five weeks. We are going to look for patterns in motivation over this time and sort people into clusters based on those patterns.

+

But before we do that, we will need to manipulate the data frame into a structure that can be analyzed by our clustering algorithm.

+

The algorithm will treat each row as a value belonging to a person, so we need to remove the id variable.

+
K2 <- select(K1, 2:6)
+

It is important to think about the meaning of missing values when clustering. We could treat them as having meaning or we could remove those people who have them. Neither option is ideal. What problems do you foresee if we recode or remove these values? Write your answers below:

+
# Recoding or removing missing values will either change the results or cause bias. When modifying the original data, the representativeness of the sample will be deducted.
+

We will remove people with missing values for this assignment, but keep in mind the issues that you have identified.

+
K3 <- na.omit(K2) #This command create a data frame with only those people with no missing values. It "omits" all rows with missing values, also known as a "listwise deletion". EG - It runs down the list deleting rows as it goes.
+
+# or make missing value = 0
+# K3<-K2
+# K3[is.na(K3)]<-0
+

Another pre-processing step used in K-means is to standardize the values so that they have the same range. We do this because we want to treat each week as equally important - if we do not standardise then the week with the largest range will have the greatest impact on which clusters are formed. We standardise the values by using the “scale()” command.

+
K3 <- scale(K3)
+

Now we will run the K-means clustering algorithm we talked about in class. 1) The algorithm starts by randomly choosing some starting values 2) Associates all observations near to those values with them 3) Calculates the mean of those clusters of values 4) Selects the observation closest to the mean of the cluster 5) Re-associates all observations closest to this observation 6) Continues this process until the clusters are no longer changing

+

Notice that in this case we have 5 variables and in class we only had 2. It is impossible to vizualise this process with 5 variables.

+

Also, we need to choose the number of clusters we think are in the data. We will start with 2.

+
fit1a <- kmeans(K3,2)
+fit1b <- kmeans(K3,2)
+fit1c <- kmeans(K3,2)
+
+
+#We have created an object called "fit" that contains all the details of our clustering including which observations belong to each cluster.
+
+#We can access the list of clusters by typing "fit$cluster", the top row corresponds to the original order the rows were in. Notice we have deleted some rows.
+
+fit1a$cluster
+
##  1  3  7 10 12 14 15 17 18 19 20 22 25 26 27 28 29 30 31 32 33 34 35 
+##  2  1  2  2  2  2  1  1  1  1  2  2  1  1  1  1  1  1  1  1  1  1  2
+
#We can also attach these clusters to the original dataframe by using the "data.frame" command to create a new data frame called K4.
+
+K4<-data.frame(K3, fit1a$cluster,fit1b$cluster,fit1c$cluster)
+
+fit1a$withinss
+
## [1] 38.34837 25.91390
+
fit1b$withinss
+
## [1] 54.95985 11.01369
+
fit1c$withinss
+
## [1] 38.34837 25.91390
+
fit1a$tot.withinss
+
## [1] 64.26227
+
fit1b$tot.withinss
+
## [1] 65.97354
+
fit1c$tot.withinss
+
## [1] 64.26227
+
fit1a$betweenss
+
## [1] 45.73773
+
fit1b$betweenss
+
## [1] 44.02646
+
fit1c$betweenss
+
## [1] 45.73773
+
K4<-data.frame(K3, fit1c$cluster)
+
+
+#Have a look at the K4 dataframe. Lets change the names of the variables to make it more convenient with the names() command.
+names(K4)<-c("1","2","3","4","5","cluster") # c()stands for concatonate and it creates a vector of anything, in this case a vector of names.
+

Now we need to visualize the clusters we have created. To do so we want to play with the structure of our data. What would be most useful would be if we could visualize average motivation by cluster, by week. To do this we will need to convert our data from wide to long format. Remember your old friends tidyr and dplyr!

+

First lets use tidyr to convert from wide to long format.

+
K5 <- tidyr::gather(K4, "week","motivation",1:5)
+

Now lets use dplyr to average our motivation values by week and by cluster.

+
K6 <- K5 %>% group_by(week, cluster) %>% summarise(avg = mean(motivation))
+
## `summarise()` regrouping output by 'week' (override with `.groups` argument)
+

Now it’s time to do some visualization:

+

https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html

+

And you can see the range of available graphics in ggplot here:

+

http://ggplot2.tidyverse.org/reference/index.html

+

We are going to create a line plot similar to the one created in this paper about school dropout Bowers, 2010. It will have motivation on the Y-axis and weeks on the X-axis. To do this we will want our weeks variables to be treated as a number, but because it was created from a variable name it is currently being treated as a character variable. You can see this if you click on the arrow on the left of K6 in the Data pane. Week is designated by “chr”. To convert it to numeric, we use the as.numeric command.

+

Likewise, since “cluster” is not numeric but rather a categorical label we want to convert it from an “integer” format to a “factor” format so that ggplot does not treat it as a number. We can do this with the as.factor() command.

+
K6$week <- as.numeric(K6$week)
+
+K6$cluster <- as.factor(K6$cluster)
+

Now we can plot our line plot using the ggplot command, “ggplot()”.

+ +
ggplot(K6, aes(week, avg, colour = cluster)) + geom_line() +  xlab("Week") + ylab("Average Motivation")
+

+

What patterns do you see in the plot?

+
# Two clusters of students start with different motivations. Cluster 1 with high motivations to start dropped steeply and then rised stteply wirh the same speed back to the starting point. And then dropeed with the same slope to negative mativation. Cluster 2 start with negative motivation. It first rised slowly compared to clsuter 1 and then droped at the same week when cluster 1 dropped with the same speed in the first stage and the rised at the same week in cluster 1 with the same speed to a end point which is higher than clsuter 1.
+
+# Both clusters fluctuate at the same time with constant speed and different direction. Even though cluster 2 start with negative motication and have lower speed, cluster 2 have higher motivation than cluster 1 at the end. 
+

It would be useful to determine how many people are in each cluster. We can do this easily with dplyr.

+
K7 <- dplyr::count(K4, cluster)
+

Look at the number of people in each cluster, now repeat this process for 3 rather than 2 clusters. Which cluster grouping do you think is more informative? Write your answer below:

+
# The 3 cluster grouping is more informative compared to 2 cluster. Three clusters are based on three average levels of motivation: high, medium, and low. The lines that cross each other and fluctuate are informative in showing that motivations levels do not remain constant. 
+

##Part II

+

Using the data collected in the HUDK4050 entrance survey (HUDK4050-cluster.csv) use K-means to cluster the students first according location (lat/long) and then according to their answers to the questions, each student should belong to two clusters.

+
library(tidyverse)
+library(dplyr)
+
+M1 <- read.csv("HUDK405020-cluster.csv", header = TRUE)
+#create a dataframe that only includes the survey questions about hours
+M2<-select(M1,4:9)
+#dealing with missing values - there are 
+
+# Generate clusters for survey questions
+fit2a<-kmeans(M2,1)
+fit2b<-kmeans(M2,2)
+fit2c<-kmeans(M2,3)
+fit2d<-kmeans(M2,4)
+fit2e<-kmeans(M2,5)
+fit2f<-kmeans(M2,6)
+fit2g<-kmeans(M2,7)
+
+mss<-c(fit2a$tot.withinss,fit2b$tot.withinss,fit2c$tot.withinss,fit2d$tot.withinss,fit2e$tot.withinss,fit2f$tot.withinss,fit2g$tot.withinss,fit2a$betweenss,fit2b$betweenss,fit2c$betweenss,fit2d$betweenss,fit2e$betweenss,fit2f$betweenss,fit2g$betweenss)
+
+clusters<-c(seq(1,7,1),seq(1,7,1))
+col<-c(rep("blue",7),rep("red",7))
+
+plot(clusters,mss,col=col)
+

+
#create a dataframe that only includes location data
+L1<-select(M1,2:3)
+#L1<-unite(L1,place,Q1_1,Q1_2,sep=",")
+
+#request lattitude and longtitude from Google Maps ApI
+#library(ggmap)
+#L2<-geocode(as.character(L1$place),output="latlon",source="dsk")
+
+#generate clusters for lat/lon
+plot(L1$long,L1$lat)
+

+
fit3a<-kmeans(L1,2)
+fit3b<-kmeans(L1,2)
+fit3c<-kmeans(L1,2)
+
+fit3a$tot.withinss
+
## [1] 23589.49
+
fit3b$tot.withinss
+
## [1] 23589.49
+
fit3c$tot.withinss
+
## [1] 23589.49
+
#combine everything tigether
+ML<-data.frame(M1$compare.features,M1$math.accuracy,M1$planner.use,M1$enjoy.discuss,M1$enjoy.group,M1$meet.deadline,fit2c$cluster,M1$lat,M1$long,fit3a$cluster)
+
+pairs(ML)
+

+

##Part III

+

Create a visualization that shows the overlap between the two clusters each student belongs to in Part II. IE - Are there geographical patterns that correspond to the answers?

+
# there are lots of ways to answer this. A common way was to do a scatterplot of students coloring the points with one set of clusters and using shapes for the other set of clusters. I think better way is to use a mosaic plot that can be generated either through ggplot or with a specific package called vd.
+
+table(ML$fit2c.cluster,ML$fit3a.cluster)
+
##    
+##      1  2
+##   1 17 18
+##   2 11 14
+##   3  8 16
+
ML2 <- ML %>% group_by(fit2c.cluster,fit3a.cluster) %>% summarize(count = n())
+
## `summarise()` regrouping output by 'fit2c.cluster' (override with `.groups` argument)
+
ggplot(ML2, aes(x = fit2c.cluster, y = fit3a.cluster, size = count)) + geom_point()
+

+
library(vcd)
+
## Loading required package: grid
+
P1 <- structable(fit2c$cluster ~ fit3a$cluster)
+mosaic(P1, shade=TRUE, legent=TRUE)
+

+
# this shows how much overlap there are between the groups of clusters
+
+
+#There is a large cluster of people who seem to be geographically close, but within the large cluster, there are three smaller clusters with people who are close to each other geographically. Most people in the large cluster have high scores.
+
+

Please render your code as an .html file using knitr and Pull Resquest both your .Rmd file and .html files to the Assignment 3 repository.

+
+ + + + +
+ + + + + + + + + + + + + + +