Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 97 additions & 35 deletions Assignment 4.Rmd → Cluster-Analysis.Rmd
Original file line number Diff line number Diff line change
@@ -1,21 +1,27 @@
---
title: "Assignment 4: K Means Clustering"
title: "K Means Clustering"
author: "Nicole Schlosberg"
date: "11/4/2020"
output: html_document
---

In this assignment we will be applying the K-means clustering algorithm we looked at in class. At the following link you can find a description of K-means:
We will be applying the K-means clustering algorithm we looked at in class. At the following link you can find a description of K-means:

https://www.cs.uic.edu/~wilkinson/Applets/cluster.html


```{r}
library()
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
```

Now, upload the file "Class_Motivation.csv" from the Assignment 4 Repository as a data frame called "K1""
```{r}

K1 <- read.csv(...)
## Part I

```{r}
K1 <- read.csv("Class_Motivation.csv", header = TRUE)
```

This file contains the self-reported motivation scores for a class over five weeks. We are going to look for patterns in motivation over this time and sort people into clusters based on those patterns.
Expand All @@ -25,33 +31,23 @@ But before we do that, we will need to manipulate the data frame into a structur
The algorithm will treat each row as a value belonging to a person, so we need to remove the id variable.

```{r}

K2 <-

K2 <- select(K1,2:6)
```

It is important to think about the meaning of missing values when clustering. We could treat them as having meaning or we could remove those people who have them. Neither option is ideal. What problems do you foresee if we recode or remove these values? Write your answers below:



We will remove people with missing values for this assignment, but keep in mind the issues that you have identified.


```{r}

K3 <- na.omit(K2) #This command create a data frame with only those people with no missing values. It "omits" all rows with missing values, also known as a "listwise deletion". EG - It runs down the list deleting rows as it goes.

```

Another pre-processing step used in K-means is to standardize the values so that they have the same range. We do this because we want to treat each week as equally important - if we do not standardise then the week with the largest range will have the greatest impact on which clusters are formed. We standardise the values by using the "scale()" command.

```{r}

K3 <-

K3 <- scale(K3)
```


Now we will run the K-means clustering algorithm we talked about in class.
1) The algorithm starts by randomly choosing some starting values
2) Associates all observations near to those values with them
Expand All @@ -65,38 +61,54 @@ Notice that in this case we have 5 variables and in class we only had 2. It is i
Also, we need to choose the number of clusters we think are in the data. We will start with 2.

```{r}
#With 2 centers or clusters
fit <- kmeans(K3, 2)

fit <-
#With 3 centers or clusters
RedoFit <- kmeans(K3, 3)

#We have created an object called "fit" that contains all the details of our clustering including which observations belong to each cluster.

#We can access the list of clusters by typing "fit$cluster", the top row corresponds to the original order the rows were in. Notice we have deleted some rows.


fit$cluster

#We can also attach these clusters to the original dataframe by using the "data.frame" command to create a new data frame called K4.

K4
#With 2 centers or clusters
K4 <- data.frame(K3, fit$cluster)

#With 3 centers or clusters
RedoFit2 <- data.frame(K3, RedoFit$cluster)

#Have a look at the K4 dataframe. Lets change the names of the variables to make it more convenient with the names() command.

#With 2 centers or clusters
names(K4) <- c("1","2","3","4","5","cluster")

#With 3 centers or clusters
names(RedoFit2) <- c("1","2","3","4","5","cluster")
```

Now we need to visualize the clusters we have created. To do so we want to play with the structure of our data. What would be most useful would be if we could visualize average motivation by cluster, by week. To do this we will need to convert our data from wide to long format. Remember your old friends tidyr and dplyr!

First lets use tidyr to convert from wide to long format.
```{r}

#With 2 centers or clusters
K5 <- gather(K4, "week", "motivation", 1:5)

#With 3 centers or clusters
RedoFit3 <- gather(RedoFit2, "week", "motivation", 1:5)
```

Now lets use dplyr to average our motivation values by week and by cluster.

```{r}
#With 2 centers or clusters
K6 <- K5 %>% group_by(week, cluster) %>% summarise(avg = mean(motivation)) #K6,

K6 <- K5 %>% group_by(week, cluster) %>% summarise(K6, avg = mean(motivation))

#With 3 centers or clusters
RedoFit4 <- RedoFit3 %>% group_by(week, cluster) %>% summarise(avg = mean(motivation))
```

Now it's time to do some visualization:
Expand All @@ -112,11 +124,13 @@ We are going to create a line plot similar to the one created in this paper abou
Likewise, since "cluster" is not numeric but rather a categorical label we want to convert it from an "integer" format to a "factor" format so that ggplot does not treat it as a number. We can do this with the as.factor() command.

```{r}
#With 2 centers or clusters
K6$week <- as.numeric(K6$week)
K6$cluster <- as.factor(K6$cluster)

K6$week <-

K6$cluster <-

#With 3 centers or clusters
RedoFit4$week <- as.numeric(RedoFit4$week)
RedoFit4$cluster <- as.factor(RedoFit4$cluster)
```

Now we can plot our line plot using the ggplot command, "ggplot()".
Expand All @@ -127,35 +141,83 @@ Now we can plot our line plot using the ggplot command, "ggplot()".
- Finally we are going to clean up our axes labels: xlab("Week") & ylab("Average Motivation")

```{r}
#With 2 centers or clusters
ggplot(K6, aes(week, avg, color = cluster)) + geom_line() + xlab("Week") + ylab("Average Motivation")

ggplot(K6, aes(week, avg, colour = cluster)) + geom_line() + xlab("Week") + ylab("Average Motivation")

#With 3 centers or clusters
ggplot(RedoFit4, aes(week, avg, color = cluster)) + geom_line() + xlab("Week") + ylab("Average Motivation")
```

What patterns do you see in the plot?

The two clusters are doing the opposite of each other. As cluster 1 increases, cluster 2 decreases and vice versa.


It would be useful to determine how many people are in each cluster. We can do this easily with dplyr.

```{r}
#With 2 centers or clusters
K7 <- count(K4, cluster)

#With 3 centers or clusters
RedoFit5 <- count(RedoFit2, cluster)
```

Look at the number of people in each cluster, now repeat this process for 3 rather than 2 clusters. Which cluster grouping do you think is more informative? Write your answer below:

##Part II
The 3 clustered and 2 clustered numbers add up to the same number, which makes sense. It also appears that when you use 3 clusters here, the 2nd and 3rd ones might be composed of the same number (just 1 extra) as the 2nd one in the 2 cluster version. I think that the 3 clusters might be most informative as splitting out that last part shows something different is happening among the groupings.

Using the data collected in the HUDK4050 entrance survey (HUDK4050-cluster.csv) use K-means to cluster the students first according location (lat/long) and then according to their answers to the questions, each student should belong to two clusters.

##Part III
## Part II

Create a visualization that shows the overlap between the two clusters each student belongs to in Part II. IE - Are there geographical patterns that correspond to the answers?
Using the data collected in the HUDK4050 entrance survey (HUDK4050-cluster.csv) use K-means to cluster the students first according location (lat/long) and then according to their answers to the questions, each student should belong to two clusters.

```{r}
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)

#first use K-means to cluster the students according to location (lat/long)
S1 <- read.csv("HUDK405020-cluster.csv", header = TRUE)
S2 <- select(S1, 2:3)

Sfit1 <- kmeans(S2, 2)
Sfit2 <- kmeans(S2, 1)

test1 <- data.frame(Sfit1$cluster,Sfit2$cluster)

#Then use K-means to cluster the students according to their answers to the questions
D1 <- read.csv("HUDK405020-cluster.csv", header = TRUE)
D2 <- select(D1,4:9)
D3 <- scale(D2)

Dfit1 <- kmeans(D3, 1)
Dfit2 <- kmeans(D3, 2)
Dfit3 <- kmeans(D3, 3)
Dfit4 <- kmeans(D3, 4)
Dfit5 <- kmeans(D3, 5)
Dfit6 <- kmeans(D3, 6)
Dfit7 <- kmeans(D3, 7)

test2 <- data.frame(Dfit1$cluster,Dfit2$cluster,Dfit3$cluster,Dfit4$cluster,Dfit5$cluster,Dfit6$cluster,Dfit7$cluster)

#Then combine the two data.frames
DS <- data.frame(S1$lat,S1$long,Sfit1$cluster,D1$compare.features, D1$math.accuracy,D1$planner.use,D1$enjoy.discuss,D1$enjoy.group,D1$meet.deadline, Dfit3$cluster)

names(DS) <- c("lat","long","location_cluster","compare_features","math_accuracy","planner_use","enjoy_discuss","enjoy_group","meet_deadline","questions_cluster")
```


## Please render your code as an .html file using knitr and Pull Resquest both your .Rmd file and .html files to the Assignment 3 repository.
## Part III

Create a visualization that shows the overlap between the two clusters each student belongs to in Part II. IE - Are there geographical patterns that correspond to the answers?

It shows that there are some overlaps between the clusters. The higher the num the greater the overlap for those clusters (longitute and planner question)

```{r}
DS2 <- DS %>% group_by(questions_cluster,location_cluster) %>% summarize(num = n(), .groups = "keep")
ggplot(DS2, aes(x = questions_cluster, y = location_cluster, size = num)) + geom_point(pch=DS2$location_cluster,color=DS2$questions_cluster) + ggtitle("Clusters") + xlim(0.00, 3.50) + ylim(0.00,2.50)
```


Loading