Skip to content

Conversation

@clayton-halim
Copy link
Member

@clayton-halim clayton-halim commented Aug 18, 2017

R

  • Created guide:
    • import
    • imputation
    • a little bit of plotting.

Python

  • Removed the large output in the python guide.
  • Used pandas.DataFrame.info() to determine the amount of missing values in each column

@clayton-halim clayton-halim changed the title Started R guide Python & R Guide Update Aug 18, 2017
@jxnl jxnl self-requested a review August 19, 2017 04:09
Copy link
Collaborator

@jxnl jxnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome communications with all the text, mostly just some style changes. Also actually easier to output it as markdown/html.

@@ -0,0 +1,545 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the .ipynb_checkpoints which you should include in the .gitignore

@@ -0,0 +1,96 @@
---
title: "R Kaggle Guide (Titanic)"
author: "UWaterloo Data Science Club"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome to use your name here :) you should take credit for the tut.

This guide will look at the Titanic dataset, we will see if we can predict what types of people would have survived on the Titanic.

So first we will import some useful libraries. R is old and there are confusing things about the language that came up over time, the tidyverse stack is a set of libraries that make these functions more consistent and powerful.
```{R}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add args like the ones below to ignores warning messages.

{R includes=FALSE, warnings=FALSE}

The `$` let's us select specific variables in a dataframe.

```{R}
titanic_data$Survived <- as.factor(titanic_data$Survived)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use this notation vs dplyr mutate

titanic_data %>%
   mutate(Survived=as.factor(Survived),
          Pclass=as.factor(Pclass), 
   ...)

We can observe the first `n` entries of our dataframe by using the `head()` function, likewise we to observe the last `n` entires we can use `tail()`. If there are too many variables, the output will omit them to save space.

```{R}
head(titanic_data, 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love for you to introduce the %>% operator just because its preferred way of doing things.

perhaps explain what it does, and show that you can do both

head(df, 5) and df %>% head(5)


## INCOMPLETE SECTION

Another method of imputation is through prediction. It would be naive to use simple methods such as mean because we have other data that hint towards the age of a passenger. We can make a model to estimate the age from the other information we have.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a section of missing data mechanisms.

more information can be found in Elements of Statistical Learning in the missing data section.

basically that missing data in itself can be predictive and we can always include is.na(feature) as a new indicator variable feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants