Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added HW.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 23 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,47 @@
# SQL & Relational Databases
# Prediction

Relational databases are the backbone of data science and the language that we use to communicate with them is called SQL. The SQL test is a common component for data-adjacenemnt jobs within industry, government and the education sector. It is a useful tool that some argue [spawned the field of data science](https://www.kdnuggets.com/gpspubs/sigkdd-explorations-kdd-10-years.html). Before Big Data was a thing, Knowledge Discovery in Databases (KDD) used simple SQL queries to investigate and understand the nature of the large amounts of data that were being collected by governments and companies. The humble SQL test now torments the budding data scientist as a right of passage in the job search process.

In this unit you will learn the basic ideas behind relational databases and SQL. You will setup a database in Amazon Web Services and then connect to it through RStudio. You will then load data into your database and run SQL queries on that data.
Prediction of student behavior has been a prominant area of research in learning analytics and a major concern for higher education institutions and ed tech companies alike. It is the bedrock of [methodology within the world of cognitive tutors](https://solaresearch.org/hla-17/hla17-chapter5/) and these methods have been exported to other areas within the education technology landscape. The ability to predict what a student is likely to do in the future so that interventions can be tailored to them has seen major growth and investment, [though implementation is non-trivial and expensive](https://www.newamerica.org/education-policy/policy-papers/promise-and-peril-predictive-analytics-higher-education/). Although some institutions, such as [Purdue University](https://www.itap.purdue.edu/learning/tools/forecast.html), have seen success we are yet to see widespread adoption of these approaches as they tend to be highly institution specific and require very concrete outcomes to be useful.

## Goals for this Unit

* Be able to discuss an overview of relational datanases and the purpose of SQL
* Be able to spin up an AWS instance and load a SQL database into it
* Be able to connect to the database through RStudio/R using the DBI package
* Be able to run basic SQL commands in RStudio using the RMySQL package

## Resources

### Video 1
[![Introduction to Relational Databases & SQL](https://img.youtube.com/vi/G-rXRbdE7ow/0.jpg)](https://youtu.be/G-rXRbdE7ow)

[Transcript](https://github.com/la-process-and-theory/sql-db-setup/blob/master/hudk4051-sql-intro.rtf)
* Be able to discuss different uses for prediction algorithms in education
* Be able to discuss the theory behind the CART, Conditional Inference Trees and C5 classification algorithms
* Construct classification models to predict student dropout and state validation metrics for the model
* Compare classification models on appropriate metrics

[Video Slide Deck](https://github.com/la-process-and-theory/sql-db-setup/blob/master/HUDK4051-SQL.pdf)
## Tasks for this Unit

### Video 2
[![AWS Setup Instructions](https://img.youtube.com/vi/JnADtoprFMM/0.jpg)](https://youtu.be/JnADtoprFMM)
In this unit you will be working towards buildimg models to predict student course dropout and then comparing those models. As background to this task please read over the follwing materials and watch the methodological videos. If you find any other useful materials please add them under **Additional Materials** at the end of the this page and pull request the change back to this repo.

[SQL Cheat Sheet](https://mariadb.com/kb/en/basic-sql-statements/)

[Overview of Amazon Web Services](https://docs.aws.amazon.com/whitepapers/latest/aws-overview/introduction.html)

[AWS](https://aws.amazon.com/)
[AWS China](https://www.amazonaws.cn/?nc1=f_ls)
## Resources

## Project: AWS Database and SQL
### Videos

Please *fork* and *clone* this repository to your computer. If you are unfamiliar with this process you must sign up for office hours.
[![Introduction to Prediction](https://img.youtube.com/vi/BqQR9n-Bolw/0.jpg)](https://youtu.be/BqQR9n-Bolw)

Then you will need to create an account with AWS. This will require a credit card although we will only be using free services. If you already have an Amazpon account you can use that account.
[Video Slide Deck](https://github.com/la-process-and-theory/prediction/blob/master/prediction-slides.pdf)

Please create the account through the regional website for your location.
[Jalayer Academy. (2015). R - Classification Trees (part 1 using C5.0)](https://www.youtube.com/watch?v=5NquIfQxpxk)

Once you have created an account follow the directions below, these steps are also shown in the video above.
[Grey, C.G.P. (2017). How Machines Learn.](https://www.youtube.com/watch?v=R9OHn5ZF4Uo)

## Step by Step to Create MySQL Instance on Amazon Web Services
### Readings

* Log into your [AWS Management Console](https://console.aws.amazon.com)
* Locate `RDS` under the `Databases` heading
* Within Amazon RDS click `Create database`
* Under `Choose a database creation method` click `Standard Create`
* Under `Engine options` choose `MySQL`
* Under `Templates` choose `Free tier`
* Under `Settings` name your `DB instance identifier` as `database-1`
* Under `Credential settings` create a username and password combination and write it down (you will need it later)
* Under `Connectivity` expand `Additional connectivity configuration` to show additional menu items and make sure that `Publicly accessible` is checked `Yes`
* Expand the `Additional configuration` menu
* Under `Initial database name` write `oudb`
* Uncheck `Automatic backups`
* Click `Create database`
* Once the database is created, take a screenshot and add it to your repository
[Adhatrao, K., Gaykar, A., Dhawan, A., Jha, R., & Honrao, V. (2013). Predicting Students’ Performance Using ID3 and C4. 5 Classification Algorithms. International Journal of Data Mining & Knowledge Management Process, 3(5).](https://arxiv.org/ftp/arxiv/papers/1310/1310.2071.pdf)

## Modify Security Group
[Brooks, C. & Thompson, C. (2017). Predictive modelling in teaching and learning. In The Handbook of Learning Analytics. SOLAR: Vancouver, BC](https://solaresearch.org/hla-17/hla17-chapter5/)

* Under `Security Groups` click `Inbound` and then `Edit`
* Add the rule `SQL/Aurora` on `Port 3306` with the `Connection` of `MyIP`
## Knowledge Check

## RStudio
Once you have completed all tasks in the unit, please complete the [knowledge check](https://tccolumbia.qualtrics.com/jfe/form/SV_eJ0QJWNsklHsdro).

* Open the sql-project.Rmd file in RStudio and follow the directions.
## Additional Materials

## Submission
[The caret package](https://topepo.github.io/caret/train-models-by-tag.html)

* Once you have completed the project please commit and pull request the repository back to the main branch. Please be sure to include your Rmd file and your screenshot of the AWS console page. The due date for this project is **January 27 by 5:00pm EDT**. Don't forget to delete your AWS database so you don't get charged any money!
[The C50 package](https://topepo.github.io/C5.0/)

## Knowledge Check
[Pradhan, C. (2016). What are the differences between ID3, C4.5 and CART? Quora](https://www.quora.com/What-are-the-differences-between-ID3-C4-5-and-CART)

[After you submit please complete the knowledge check quiz located here](https://tccolumbia.qualtrics.com/jfe/form/SV_2i3mluBkpyjW0Um)


51 changes: 51 additions & 0 deletions carInfo.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
id,car_name,price,made_date
1,car_1,5200,2020/1/23
2,car_2,9900,2020/1/23
3,car_3,14300,2020/1/23
4,car_4,18500,2020/1/23
5,car_5,22500,2020/1/23
6,car_6,26300,2020/1/23
7,car_7,29900,2020/1/23
8,car_8,33300,2020/1/23
9,car_9,36500,2020/1/23
10,car_10,39500,2020/1/23
11,car_11,42300,2020/3/29
12,car_12,44900,2020/3/29
13,car_13,47300,2020/3/29
14,car_14,49500,2020/3/29
15,car_15,51500,2020/3/29
16,car_16,53300,2020/3/29
17,car_17,54900,2020/3/29
18,car_18,56300,2020/3/29
19,car_19,57500,2020/3/29
20,car_20,58500,2020/3/29
21,car_21,59300,2020/7/1
22,car_22,59900,2020/7/1
23,car_23,60300,2020/7/1
24,car_24,60500,2020/7/1
25,car_25,60500,2020/7/1
26,car_26,60300,2020/7/1
27,car_27,59900,2020/7/1
28,car_28,59300,2020/7/1
29,car_29,58500,2020/7/1
30,car_30,57500,2020/7/1
31,car_31,56300,2020/9/15
32,car_32,54900,2020/9/15
33,car_33,53300,2020/9/15
34,car_34,51500,2020/9/15
35,car_35,49500,2020/9/15
36,car_36,47300,2020/9/15
37,car_37,44900,2020/9/15
38,car_38,42300,2020/9/15
39,car_39,39500,2020/9/15
40,car_40,36500,2020/9/15
41,car_41,33300,2020/12/31
42,car_42,29900,2020/12/31
43,car_43,26300,2020/12/31
44,car_44,22500,2020/12/31
45,car_45,18500,2020/12/31
46,car_46,14300,2020/12/31
47,car_47,9900,2020/12/31
48,car_48,5300,2020/12/31
49,car_49,5500,2020/12/31
50,car_50,4500,2020/12/31
179 changes: 179 additions & 0 deletions predictionKarl.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
---
title: "HUDK4051: Prediction - Comparing Trees"
author: "Charles Lang"
date: "1/9/2018"
output: html_document
---

In this assignment you will modelling student data using three flavors of tree algorithm: CART, C4.5 and C5.0. We will be using these algorithms to attempt to predict which students drop out of courses. Many universities have a problem with students over-enrolling in courses at the beginning of semester and then dropping most of them as the make decisions about which classes to attend. This makes it difficult to plan for the semester and allocate resources. However, schools don't want to restrict the choice of their students. One solution is to create predictions of which students are likley to drop out of which courses and use these predictions to inform semester planning.

In this assignment we will be using the tree algorithms to build models of which students are likely to drop out of which classes.

## Software

In order to generate our models we will need several packages. The first package you should install is [caret](https://cran.r-project.org/web/packages/caret/index.html).

There are many prediction packages available and they all have slightly different syntax. caret is a package that brings all the different algorithms under one hood using the same syntax.

We will also be accessing an algorithm from the [Weka suite](https://www.cs.waikato.ac.nz/~ml/weka/). Weka is a collection of machine learning algorithms that have been implemented in Java and made freely available by the University of Waikato in New Zealand. To access these algorithms you will need to first install both the [Java Runtime Environment (JRE) and Java Development Kit](http://www.oracle.com/technetwork/java/javase/downloads/jre9-downloads-3848532.html) on your machine. You can then then install the [RWeka](https://cran.r-project.org/web/packages/RWeka/index.html) package within R.

**Weka requires Java and Java causes problems. If you cannot install Java and make Weka work, please follow the alternative instructions at line 121**
(Issue 1: failure to install RWeka/RWekajars, paste "sudo R CMD javareconf" into terminal and try to install again)

The last package you will need is [C50](https://cran.r-project.org/web/packages/C50/index.html).

## Data

The data comes from a university registrar's office. The code book for the variables are available in the file code-book.txt. Examine the variables and their definitions.

Upload the drop-out.csv data into R as a data frame.

```{r}
library(plyr)
library(C50)
library(party)
library(caret)
library(RWeka)
library(MLmetrics)
## Data Import
student <- read.csv('/Users/prediction-master/drop-out.csv')
summary(student) ## No missing values

```

The next step is to separate your data set into a training set and a test set. Randomly select 25% of the students to be the test data set and leave the remaining 75% for your training data set. (Hint: each row represents an answer, not a single student.)

```{r}
## Data partition
trainData <- createDataPartition(
y=student$complete, p = 0.75, list=FALSE
)
training <- student[trainData,]
testing <- student[-trainData,]
training<-training[,-1]
testing<-testing[,-1]

```

For this assignment you will be predicting the student level variable "complete".
(Hint: make sure you understand the increments of each of your chosen variables, this will impact your tree construction)

Visualize the relationships between your chosen variables as a scatterplot matrix. Save your image as a .pdf named scatterplot_matrix.pdf. Based on this visualization do you see any patterns of interest? Why or why not?

```{r}
## Scatter plot
student_noid <- student[,-1]
pairs(~.,data=student_noid)
#pdf(file = "/Users/My_Plot.pdf")
## Since the response variable is non-numeric, the scatter cann't provide
## meaning full information in variable selection
```

## CART Trees

You will use the [rpart package](https://cran.r-project.org/web/packages/rpart/rpart.pdf) to generate CART tree models.

Construct a classification tree that predicts complete using the caret package.

```{r}
library(caret)
MySummary <- function(data, lev = NULL, model = NULL){
df <- defaultSummary(data, lev, model)
tc <- twoClassSummary(data, lev, model)
pr <- prSummary(data, lev, model)
out <- c(df,tc,pr)
out}

## K fold validation
ctrl <- trainControl(method = 'repeatedcv',repeats = 3,
classProbs = TRUE,
summaryFunction = MySummary)
## Model training
fit1 <- train(complete~.,data=training,
method="rpart",
preProc=c("center",'scale'),
trControl =ctrl,
metric="Accuracy"
)
fit1$bestTune ## the best tunning in this case is when cp = 0.01005
fit1 ## It is good model fit, the overall accuracy is about 0.89, also a good specificity and an acceptable specificity
sens=0.6553508
spec=0.9951672
2*(sens*spec)/(sens+spec) ## when cp=0.01005, F1 score

```

Describe important model attribues of your tree. Do you believe it is a successful model of student performance, why/why not?
It is good model fit, the overall accuracy is about 0.89, also a good specificity and an acceptable specificity
Can you use the sensitivity and specificity metrics to calculate the F1 metric?
2*(sens*spec)/(sens+spec) ## when cp=0.01005, F1 score

Now predict results from the test data and describe important attributes of this test. Do you believe it is a successful model of student performance, why/why not?
It is a good model, since the accuracy is about 0.91
```{r}
## predict for the test dataset
cartClasses <- predict(fit1,newdata = testing) ## prediction result
confusionMatrix(data = cartClasses, as.factor(testing$complete)) #confusion Matrix


```

## Conditional Inference Trees

Train a Conditional Inference Tree using the `party` package on the same training data and examine your results.
```{r}
## Conditional Inference Trees
conFit<-train(complete~.,data=training,
method="cforest",
preProc=c("center",'scale'),
trControl =ctrl,
metric="Accuracy"
)
conFit
fit1$finalModel
```
Describe important model attribues of your tree. Do you believe it is a successful model of student performance, why/why not?
the most important variable is years to decide whether a student will drop out.
What does the plot represent? What information does this plot tell us?

Now test your new Conditional Inference model by predicting the test data and generating model fit statistics.
```{r}
## predict for the test dataset
cartClasses <- predict(conFit,newdata = testing) ## prediction result
confusionMatrix(data = cartClasses, as.factor(testing$complete)) # the accuracy is about 0.90
```

There is an updated version of the C4.5 model called C5.0, it is implemented in the C50 package. What improvements have been made to the newer version?

Install the C50 package, train and then test the C5.0 model on the same data.

```{r}
## Conditional Inference Trees
c50Fit<-train(complete~.,data=training,
method="C5.0Cost",
preProc=c("center",'scale'),
trControl =ctrl,
metric="Accuracy"
)
c50Fit
```

## Compare the models

caret allows us to compare all three models at once.

```{r}
## summary of the model
resamps <- resamples(list(cart = fit1, condinf = conFit, cfiveo = c50Fit))
summary(resamps)

conFit$finalModel
c50Fit$finalModel
## after checking the three final models, the most important variable is years to decide whether a student will drop out.

```

What does the model summary tell us? Which model do you believe is the best?
It is really hard to say which one is the best model.When comparing the three models, the overall accuracies are pretty close about 0.90, the senstivity for these three models are not very high which are about 0.65,the specificity for these three models are very high which are about 0.99
Which variables (features) within your chosen model are important, do these features provide insights that may be useful in solving the problem of students dropping out of courses?
years and course_id. Yes, these two features are most important in seperating the dataset
Loading