From b6c2e317fbc0948025e98231dc78a3490daa1823 Mon Sep 17 00:00:00 2001 From: Mikkel Date: Tue, 15 Feb 2022 10:04:45 +0100 Subject: [PATCH] deleted week3 in root --- Week3.ipynb | 172 ---------------------------------------------------- 1 file changed, 172 deletions(-) delete mode 100644 Week3.ipynb diff --git a/Week3.ipynb b/Week3.ipynb deleted file mode 100644 index 18f9062..0000000 --- a/Week3.ipynb +++ /dev/null @@ -1,172 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Week 3\n", - "\n", - "I hope you're getting the hang of things. Today we're going on with the prinicples of data visualization!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Overview\n", - "\n", - "Once again, the lecture has three parts:\n", - "\n", - "* First you will watch a video on visualization and solve a couple of exercises.\n", - "* After that, we'll be reading about *scientific data visualization*, and the huge number of things you can do with just one variable. Naturally, we'll be answering questions about that book. \n", - "* And finally reproducing some of the plots from that book." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part 1: Fundamentals of data visualization" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Last week we had a small introduction of data visualization. Today, we are going to be a bit more specific on data analysis and visualization. Digging a bit more into the theory with the next video.\n", - "\n", - "*It's important to highlight that these lectures are quite important. We don't have a formal book on data visualization. So the only source of knowledge about the **principles**, **theories**, and **ideas**, that are the foundation for good data viz, comes from the videos*. So watch them 🤓 \n", - "\n", - "[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/yiU56codNlI/0.jpg)](https://www.youtube.com/watch?v=yiU56codNlI)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> *Excercise 1.1:* Questions for the lecture\n", - "> * As mentioned earlier, visualization is not the only way to test for correlation. We can (for example) calculate the Pearson correlation. Explain in your own words how the Pearson correlation works and write down it's mathematical formulation. Can you think of an example where it fails (and visualization works)?\n", - "> * What is the difference between a bar-chart and a histogram?\n", - "> * I mention in the video that it's important to choose the right bin-size in histograms. But how do you do that? Do a Google search to find a criterion you like and explain it. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, now that we've talked a bit about correlation and distributions, we are going to compute/visualize them while also testing some hypotheses along the way. Until now, we have analysed data at an explorative level, but we can use statistics to verify whether relationships between variables are significant. We'll do this in the following exercise.\n", - "\n", - "> *Exercise 1.2:* Hypothesis testing. We will look into correlations between number of steps and BMI, and differences between two data samples (Females vs Males). Follow the steps below for success:\n", - ">\n", - "> * First, we need to get some data. Download and read the data from the Female group [here](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/data9b_f.csv) and the one from the Male group [here](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/data9b_m.csv).\n", - "> * Next, we are going to verify the following hypotheses:\n", - "> 1. *H1: there is a statistically significant difference in the average number of steps taken by men and women*. Is there a statistically significant difference between the two groups? What is the difference between their mean number of steps? Plot two histograms to visualize the step-count distributions, and use the criterion you chose in Ex.1.1 to define the right bin-size. \n", - " **Hint** you can use the function `ttest_ind()` from the `stats` package to test the hypothesis and consider a significance level $\\alpha=0.05$.\n", - "> 2. *H2: there is a negative correlation between the number of steps and the BMI for women*. We will use Pearson's correlation here. Is there a negative correlation? How big is it?\n", - "> 3. *H3: there is a positive correlation between the number of steps and the BMI for men*. Is there a positive correlation? Compare it with the one you found for women.\n", - "> * We have now gathered the results. Can you find a possible explanation for what you observed? You don't need to come up with a grand theory about mobility and gender, just try to find something (e.g. theory, news, papers, further analysis etc.) to support your conclusions and write down a couple of sentences. \n", - "\n", - "> *Exercise 1.3:* scatter plots. We're now going to fully visualize the data from the previous exercise.\n", - ">\n", - "> * Create a scatter plot with both data samples. Use `color='#f6756d'` for one sample and `color='#10bdc3'` for the other sample. The data is in front of you, what do you observe? Take a minute to think about these exercises: what do you think the point is? \n", - " * After answering the questions above, have a look at this [paper](https://genomebiology.biomedcentral.com/track/pdf/10.1186/s13059-020-02133-w.pdf) (in particular, read the *Not all who wander are lost* section).\n", - "> * The scatter plot made me think of another point we often overlook: *color-vision impairments*. When visualizing and explaining data, we need to think about our audience:\n", - "> * We used the same colors as in the paper, try to save the figure and use any color-blindness simulator you find on the web ([this](https://www.color-blindness.com/coblis-color-blindness-simulator/) was the first that came out in my browser). Are the colors used problematic? Explain why, and try different types of colors. If you are interested in knowing more you can read this [paper](https://www.tandfonline.com/doi/pdf/10.1179/000870403235002042?casa_token=MAYp78HctgQAAAAA:AZKSHJWuNmoMXD5Dtqln1Sc-xjNwCe6UVDMVEpP95UjTH3O1H-NKRkfYljw2VLSm_zKlN74Da6g).\n", - "> * But, are colors the only option we have? Find an alternative to colors, explain it, and change your scatter plot accordingly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part 2: Reading about the theory of visualization\n", - "\n", - "Since we can go deeper with the visualization this year, we are going to read the first couple of chapters from [*Data Analysis with Open Source Tools*](http://shop.oreilly.com/product/9780596802363.do) (DAOST). It's pretty old, but I think it's a fantastic resource and one that is pretty much as relevant now as it was back then. The author is a physicist (like Sune) so he likes the way he thinks. And the books takes the reader all the way from visualization, through modeling to computational mining. Anywho - it's a great book and well worth reading in its entirety. \n", - "\n", - "As part of this class we'll be reading the first chapters. Today, we'll read chaper 2 (the first 28 pages) which supports and deepens many of the points we made during the video above. \n", - "\n", - "To find the text, you will need to go to **DTU Learn**. It's under \"Course content\" $\\rightarrow$ \"Content\" $\\rightarrow$ \"Lecture 3 reading\"." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> *Excercise 2*: Questions for DAOST\n", - "> * Explain in your own words the point of the jitter plot.\n", - "> * Explain in your own words the point of figure 2-3. (I'm going to skip saying \"in your own words\" going forward, but I hope you get the point; I expect all answers to be in your own words).\n", - "> * The author of DAOST (Philipp Janert) likes KDEs (and think they're better than histograms). And we don't. Sune didn't give a detailed explanation in the video, but now that works to our advantage. We'll ask you to think about this and thereby create an excellent exercise: When can KDEs be misleading? \n", - "> * Sune discussed some strengths of the CDF - there are also weaknesses. Janert writes \"CDFs have less intuitive appeal than histograms of KDEs\". What does he mean by that?\n", - "> * What is a *Quantile plot*? What is it good for. \n", - "> * How is a *Probablity plot* defined? What is it useful for? Have you ever seen one before?\n", - "> * One of the reasons we like DAOST is that Janert is so suspicious of mean, median, and related summary statistics. Explain why one has to be careful when using those - and why visualization of the full data is always better. \n", - "> * Sune loves box plots (but not enough to own one of [these](https://twitter.com/statisticiann/status/1387454947143426049) 😂). When are box plots most useful?\n", - "> * The book doesn't mention [violin plots](https://en.wikipedia.org/wiki/Violin_plot). Are those better or worse than box plots? Why?" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Part 3: *Finally*! Let's create some visualizations" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> *Excercise 3.1*: Connecting the dots and recreating plots from DAOST but using our own favorite dataset.\n", - "> * Let's make a jitter-plot (that is, code up something like **Figure 2-1** from DAOST from scratch), but based on *SF Police data*. My hunch from inspecting the file is that the police-folks might be a little bit lazy in noting down the **exact** time down to the second. So choose a crime-type and a suitable time interval (somewhere between a month and 6 months depending on the crime-type) and create a jitter plot of the arrest times during a single hour (like 13-14, for example). So let time run on the $x$-axis and create vertical jitter.\n", - "> * Last time, we did lots of bar-plots. Today, we'll play around with histograms (creating two crime-data based versions of the plot-type shown in DAOST **Figure 2-2**). I think the GPS data could be fun to see this way. \n", - "> * This time, pick two crime-types with different geographical patterns **and** a suitable time-interval for each (you want between 1000 and 10000 points in your histogram)\n", - "> * Then take the latitude part of the GPS coordinates for each crime and bin the latitudes so that you have around 50 bins across the city of SF. You can use your favorite method for binning. I like `numpy.histogram`. This function gives you the counts and then you do your own plotting. \n", - "> * Next up is using the plot-type shown in **Figure 2-4** from DAOST, but with the data you used to create Figure 2.1. To create the kernel density plot, you can either use `gaussian_kde` from `scipy.stats` ([for an example, check out this stackoverflow post](https://stackoverflow.com/questions/4150171/how-to-create-a-density-plot-in-matplotlib)) or you can use [`seaborn.kdeplot`](https://seaborn.pydata.org/generated/seaborn.kdeplot.html).\n", - "> * Now grab 25 random timepoints from the dataset (of 1000-10000 original data) you've just plotted and create a version of Figure 2-4 based on the 25 data points. Does this shed light on why I think KDEs can be misleading? \n", - ">\n", - "> Let's take a break. Get some coffee or water. Stretch your legs. Talk to your friends for a bit. Breathe. Get relaxed so you're ready for the second part of the exercise. \n", - "\n", - "> *Exercise 3.2*. Ok. Now for more plots 😊\n", - "> * Now we'll work on creating two versions of the plot in **Figure 2-11**, but using the GPS data you used for your version of Figure 2-2. Comment on the result. It is not easy to create this plot from scracth. \n", - " **Hint:** Take a look at the `scipy.stats.probplot` function. \n", - "> * OK, we're almost done, but we need some box plots. Here, I'd like you to use the box plots to visualize fluctuations of how many crimes happen per day. We'll use data from the 15 focus crimes defined last week.\n", - "> * For the full time-span of the data, calulate the **number of crimes per day** within each category for the entire duration of the data.\n", - "> * Create a box-and whiskers plot showing the mean, median, quantiles, etc for all 15 crime-types side-by-side. There are many ways to do this. I like to use [matplotlibs's built in functionality](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html), but you can also achieve good results with [seaborn](https://seaborn.pydata.org/generated/seaborn.boxplot.html) or [pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.boxplot.html).\n", - "> * What does this plot reveal that you can't see in the plots from last time?\n", - "> * Also I want to show you guys another interesting use of box plots. To get started, let's calculate another average for each focus-crime, namely what time of day the crime happens. So this time, the distribution we want to plot is the average time-of-day that a crime takes place. There are many ways to do this, but let me describe one way to do it. \n", - " * For datapoint, the only thing you care about is the time-of-day, so discard everything else.\n", - " * You also have to deal with the fact that time is annoyingly not divided into nice units that go to 100 like many other numbers. I can think of two ways to deal with this.\n", - " * For each time-of-day, simply encode it as seconds since midnight.\n", - " * Or keep each whole hour, and convert the minute/second count to a percentage of an hour. So 10:15 $\\rightarrow$ 10.25, 8:40 $\\rightarrow$ 8.67, etc.\n", - " * Now you can create box-plots to create an overview of *when various crimes occur*. Note that these plot have quite a different interpretation than ones we created in the previous exercise. Cool, right? " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.7" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}