From 48d3f20f0f48e5b794c5abe8e3a499376260d974 Mon Sep 17 00:00:00 2001 From: timothyleeXQ Date: Thu, 26 Sep 2019 18:53:21 -0400 Subject: [PATCH 1/4] Completed activity Finished the questions, may improve and update. --- class-activity-2.Rmd | 48 ++++++++++++++++++++++++++++---------------- 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/class-activity-2.Rmd b/class-activity-2.Rmd index e547dd9..6b61a94 100644 --- a/class-activity-2.Rmd +++ b/class-activity-2.Rmd @@ -15,14 +15,16 @@ D2 <- filter(D1, schoolyear == 20112012) #Histograms ```{r} -#Generate a histogramof the percentage of free/reduced lunch students (frl_percent) at each school +#Generate a histogram of the percentage of free/reduced lunch students (frl_percent) at each school -hist() +hist(D2$frl_percent) #Change the number of breaks to 100, do you get the same impression? hist(D2$frl_percent, breaks = 100) +#Yes. Both histograms show negative skew with most data around 80% of students with free/reduced lunch. + #Cut the y-axis off at 30 hist(D2$frl_percent, breaks = 100, ylim = c(0,30)) @@ -31,8 +33,6 @@ hist(D2$frl_percent, breaks = 100, ylim = c(0,30)) hist(D2$frl_percent, breaks = c(0,10,20,80,100)) - - ``` #Plots @@ -76,25 +76,33 @@ pairs(D5) 1. Create a simulated data set containing 100 students, each with a score from 1-100 representing performance in an educational game. The scores should tend to cluster around 75. Also, each student should be given a classification that reflects one of four interest groups: sport, music, nature, literature. ```{r} -#rnorm(100, 75, 15) creates a random sample with a mean of 75 and standard deviation of 20 +#rnorm(100, 75, 15) creates a random sample with a mean of 75 and standard deviation of 15 #pmax sets a maximum value, pmin sets a minimum value #round rounds numbers to whole number values #sample draws a random samples from the groups vector according to a uniform distribution - - +studentPerformance = rnorm(100, 75, 15) +studentPerformance[studentPerformance > 100] = 100 +studentPerformance[studentPerformance < 1] = 1 +studentPerformance = round(studentPerformance) +studentInterest = sample(c("sport", "music", "nature", "literature"), size = 100, replace = TRUE) +studentData = data.frame(id = c(1:100), studentPerformance, studentInterest) ``` 2. Using base R commands, draw a histogram of the scores. Change the breaks in your histogram until you think they best represent your data. ```{r} - +hist(studentData$studentPerformance, breaks = 8) ``` - 3. Create a new variable that groups the scores according to the breaks in your histogram. ```{r} #cut() divides the range of scores into intervals and codes the values in scores according to which interval they fall. We use a vector called `letters` as the labels, `letters` is a vector made up of the letters of the alphabet. +letters = c("F", "E", "D", "C", "B", "A") +studentGrade = cut(studentData$studentPerformance, + breaks = c(40, 50, 60, 70, 80, 90, 100), + labels = letters) +studentData = cbind(studentData, studentGrade) ``` @@ -106,9 +114,10 @@ library(RColorBrewer) #The top section of palettes are sequential, the middle section are qualitative, and the lower section are diverging. #Make RColorBrewer palette available to R and assign to your bins +histColourPalette = brewer.pal(7, "OrRd") #Use named palette in histogram - +hist(studentData$studentPerformance, breaks = 8, col = histColourPalette) ``` @@ -116,20 +125,26 @@ library(RColorBrewer) ```{r} #Make a vector of the colors from RColorBrewer - +boxColourPalette = brewer.pal(4, "Spectral") +boxplot(studentData$studentPerformance ~ studentData$studentInterest, col = boxColourPalette) ``` - 6. Now simulate a new variable that describes the number of logins that students made to the educational game. They should vary from 1-25. ```{r} - +logins = sample(c(1:25), size = 100, replace = TRUE) +studentData = cbind(studentData, logins) ``` 7. Plot the relationships between logins and scores. Give the plot a title and color the dots according to interest group. ```{r} - +plot(studentData$studentPerformance, studentData$logins, + main = "Plot of Logins against scores", + xlab = "Student Scores", + ylab = "Student Logins", + col = studentData$studentInterest, + pch=19) ``` @@ -137,14 +152,13 @@ library(RColorBrewer) 8. R contains several inbuilt data sets, one of these in called AirPassengers. Plot a line graph of the the airline passengers over time using this data set. ```{r} - +plot(AirPassengers, type = "l") ``` - 9. Using another inbuilt data set, iris, plot the relationships between all of the variables in the data set. Which of these relationships is it appropraiet to run a correlation on? ```{r} - +pairs(iris) ``` 10. Finally use the knitr function to generate an html document from your work. If you have time, try to change some of the output using different commands from the RMarkdown cheat sheet. From d1f20740d6ae59f5bed19f7cbe8433343752af12 Mon Sep 17 00:00:00 2001 From: timothyleeXQ Date: Sun, 29 Sep 2019 14:20:28 -0400 Subject: [PATCH 2/4] Finished assignment. Committing changes. Committing completed Rmd file and its associated knitted html file --- class-activity-2.Rmd | 2 + class-activity-2.html | 570 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 572 insertions(+) create mode 100644 class-activity-2.html diff --git a/class-activity-2.Rmd b/class-activity-2.Rmd index 6b61a94..52dbe55 100644 --- a/class-activity-2.Rmd +++ b/class-activity-2.Rmd @@ -163,4 +163,6 @@ pairs(iris) 10. Finally use the knitr function to generate an html document from your work. If you have time, try to change some of the output using different commands from the RMarkdown cheat sheet. +*I presume this just means press "knit"?* + 11. Commit, Push and Pull Request your work back to the main branch of the repository diff --git a/class-activity-2.html b/class-activity-2.html new file mode 100644 index 0000000..b6c1f25 --- /dev/null +++ b/class-activity-2.html @@ -0,0 +1,570 @@ + + + + + + + + + + + + + + + + +intro to viz + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

#Input

+
D1 <- read.csv("School_Demographics_and_Accountability_Snapshot_2006-2012.csv", header = TRUE, sep = ",")
+
+#Create a data frame only contains the years 2011-2012
+library(dplyr)
+
## 
+## Attaching package: 'dplyr'
+
## The following objects are masked from 'package:stats':
+## 
+##     filter, lag
+
## The following objects are masked from 'package:base':
+## 
+##     intersect, setdiff, setequal, union
+
D2 <- filter(D1, schoolyear == 20112012)
+

#Histograms

+
#Generate a histogram of the percentage of free/reduced lunch students (frl_percent) at each school
+
+hist(D2$frl_percent)
+

+
#Change the number of breaks to 100, do you get the same impression?
+
+hist(D2$frl_percent, breaks = 100)
+

+
#Yes. Both histograms show negative skew with most data around 80% of students with free/reduced lunch.
+
+#Cut the y-axis off at 30
+
+hist(D2$frl_percent, breaks = 100, ylim = c(0,30))
+

+
#Restore the y-axis and change the breaks so that they are 0-10, 10-20, 20-80, 80-100
+
+hist(D2$frl_percent, breaks = c(0,10,20,80,100))
+

+

#Plots

+
#Plot the number of English language learners (ell_num) by Computational Thinking Test scores (ctt_num) 
+
+plot(D2$ell_num, D2$ctt_num)
+

+
#Create two variables x & y
+x <- c(1,3,2,7,6,4,4)
+y <- c(2,4,2,3,2,4,3)
+
+#Create a table from x & y
+table1 <- table(x,y)
+
+#Display the table as a Barplot
+barplot(table1)
+

+
#Create a data frame of the average total enrollment for each year and plot the two against each other as a lines
+
+library(tidyr)
+
## 
+## Attaching package: 'tidyr'
+
## The following object is masked _by_ '.GlobalEnv':
+## 
+##     table1
+
D3 <- D1 %>% group_by(schoolyear) %>% summarise(mean_enrollment = mean(total_enrollment))
+
+plot(D3$schoolyear, D3$mean_enrollment, type = "l", lty = "dashed")
+

+
#Create a boxplot of total enrollment for three schools
+D4 <- filter(D1, DBN == "31R075"|DBN == "01M015"| DBN == "01M345")
+#The drop levels command will remove all the schools from the variable with not data  
+D4 <- droplevels(D4)
+boxplot(D4$total_enrollment ~ D4$DBN)
+

#Pairs

+
#Use matrix notation to select columns 5,6, 21, 22, 23, 24
+D5 <- D2[,c(5,6, 21:24)]
+#Draw a matrix of plots for every combination of variables
+pairs(D5)
+

# Exercise

+
    +
  1. Create a simulated data set containing 100 students, each with a score from 1-100 representing performance in an educational game. The scores should tend to cluster around 75. Also, each student should be given a classification that reflects one of four interest groups: sport, music, nature, literature.
  2. +
+
#rnorm(100, 75, 15) creates a random sample with a mean of 75 and standard deviation of 15
+#pmax sets a maximum value, pmin sets a minimum value
+#round rounds numbers to whole number values
+#sample draws a random samples from the groups vector according to a uniform distribution
+studentPerformance = rnorm(100, 75, 15)
+studentPerformance[studentPerformance > 100] = 100
+studentPerformance[studentPerformance < 1] = 1
+studentPerformance = round(studentPerformance)
+studentInterest = sample(c("sport", "music", "nature", "literature"), size = 100, replace = TRUE)
+studentData = data.frame(id = c(1:100), studentPerformance, studentInterest)
+
    +
  1. Using base R commands, draw a histogram of the scores. Change the breaks in your histogram until you think they best represent your data.
  2. +
+
hist(studentData$studentPerformance, breaks = 8)
+

+
    +
  1. Create a new variable that groups the scores according to the breaks in your histogram.
  2. +
+
#cut() divides the range of scores into intervals and codes the values in scores according to which interval they fall. We use a vector called `letters` as the labels, `letters` is a vector made up of the letters of the alphabet.
+letters = c("F", "E", "D", "C", "B", "A")
+studentGrade = cut(studentData$studentPerformance,
+                  breaks = c(40, 50, 60, 70, 80, 90, 100),
+                  labels = letters)
+studentData = cbind(studentData, studentGrade)
+
    +
  1. Now using the colorbrewer package (RColorBrewer; http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3) design a pallette and assign it to the groups in your data on the histogram.
  2. +
+
library(RColorBrewer)
+#Let's look at the available palettes in RColorBrewer
+
+#The top section of palettes are sequential, the middle section are qualitative, and the lower section are diverging.
+#Make RColorBrewer palette available to R and assign to your bins
+histColourPalette = brewer.pal(7, "OrRd")
+
+#Use named palette in histogram
+hist(studentData$studentPerformance, breaks = 8, col = histColourPalette)
+

+
    +
  1. Create a boxplot that visualizes the scores for each interest group and color each interest group a different color.
  2. +
+
#Make a vector of the colors from RColorBrewer
+boxColourPalette = brewer.pal(4, "Spectral")
+boxplot(studentData$studentPerformance ~ studentData$studentInterest, col = boxColourPalette)
+

+
    +
  1. Now simulate a new variable that describes the number of logins that students made to the educational game. They should vary from 1-25.
  2. +
+
logins = sample(c(1:25), size = 100, replace = TRUE)
+studentData = cbind(studentData, logins)
+
    +
  1. Plot the relationships between logins and scores. Give the plot a title and color the dots according to interest group.
  2. +
+
plot(studentData$studentPerformance, studentData$logins,
+     main = "Plot of Logins against scores",
+     xlab = "Student Scores",
+     ylab = "Student Logins",
+     col = studentData$studentInterest,
+     pch=19)
+

+
    +
  1. R contains several inbuilt data sets, one of these in called AirPassengers. Plot a line graph of the the airline passengers over time using this data set.
  2. +
+
plot(AirPassengers, type = "l")
+

+
    +
  1. Using another inbuilt data set, iris, plot the relationships between all of the variables in the data set. Which of these relationships is it appropraiet to run a correlation on?
  2. +
+
pairs(iris)
+

+
    +
  1. Finally use the knitr function to generate an html document from your work. If you have time, try to change some of the output using different commands from the RMarkdown cheat sheet.
  2. +
+

I presume this just means press “knit”?

+
    +
  1. Commit, Push and Pull Request your work back to the main branch of the repository
  2. +
+ + + + +
+ + + + + + + + + + + + + + + From 0cfc33638b438b10fcf27cb0b6c9df55077e4527 Mon Sep 17 00:00:00 2001 From: Timothy Lee Date: Wed, 8 Apr 2020 16:35:04 +0800 Subject: [PATCH 3/4] add .gitattributes --- .gitattributes | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 .gitattributes diff --git a/.gitattributes b/.gitattributes new file mode 100644 index 0000000..32a4ddf --- /dev/null +++ b/.gitattributes @@ -0,0 +1,2 @@ +*.html linguist-detectable=false +*.Rmd linguist-language=R \ No newline at end of file From f0e8308d8844cbf1c45b23dc00796b26b8378a7f Mon Sep 17 00:00:00 2001 From: Timothy Lee Date: Wed, 8 Apr 2020 16:36:04 +0800 Subject: [PATCH 4/4] update readme with 4050 inst notes header, info --- README.md | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index a7fccc8..70c8433 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,23 @@ -# Class Activity 2 - Introduction to Vizualization +# Introduction to Vizualization +This repo contains files for an in class activity (class activity 2) on data +visualisation using base R plotting functions for HUDK 4050: Core Methods in +Educational Data Mining. + +HUDK 4050 is the first of three core courses in the Learning Analytics MS at +Teachers College, Columbia University focusing on the thinking, methods, and +conventions in data science. Particular attention is given to the fields of +Educational Data Mining and Learning Analytics. Refer to the +[Syllabus](https://github.com/timothyLeeXQ/HUDK-4050-Syllabus) (forked from +the [main repo](https://github.com/core-methods-in-edm/syllabus) which may +contain updates for future class iterations) for more information on HUDK 4050. + +Other classes in the series are: +* [HUDK 4051: Learning Analytics: + Process and Theory](https://github.com/timothyLeeXQ/HUDK-4051-Syllabus) ([Main + repo](https://github.com/la-process-and-theory/syllabus)) +* HUDK 5053: Feature Engineering Studio (Starting in May 2020. + [Main repo](https://github.com/feature-engineering-studio/syllabus)) + +## Instructor Notes Introduction to Visualization using the Base R commands. Please fork and clone this repo and open the .Rmd file for further instructions.