A display of what you accomplished in the course, shareable in your professional networks such as LinkedIn, similar to online education services such as Coursera.
+
+
What it isn’t:
+
+
Accreditation through an university or degree-granting program.
+
+
+
Requirements:
+
+
Complete badge-required sections of the exercises for 3 out of 4 assignments.
+
+
+
+
+
Databases…
+
+
What are some Databases you are interested in?
+
+
+
+
Why do we need a Database Management System (DBMS) to manage it? (What could go wrong in managing a spreadsheet?)
+
+
+
+
Benefits of a DBMS:
+
+
Data Integrity: What are the rules within the database? If it is a medical database, does a patient always have a visit site? How do we deal with missing data? Are duplicated entries allowed?
+
+
+
+
+
Implementation: How do you find a particular record? What if we now want to create a new application that uses the same database? What if that application is running on a different machine?
+
+
+
+
+
Durability: What if the machine crashes while our program is updating a record? What if we want to replicate the database on multiple machines?
+
+
+
+
+
Database Management System (DBMS) consists of
+
+
A user interface - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language).
+
An execution engine - a software system that queries the data in storage. These can live on our machine, on a server within our network, or a server on the cloud.
+
Data Storage - the physical location where the data is stored.
+
+
+
+
DBMS examples
+
+
+
+
+
+
+
+
+
+
+
This class
+
Example Hutch on-site database system
+
Example Hutch cloud database system
+
+
+
+
+
User Interface
+
SQL
+
SQL
+
SQL
+
+
+
Execution Engine
+
DuckDB
+
SQL Server
+
Databrick/Snowflake
+
+
+
Data Storage
+
File on our machine
+
FH Shared Storage
+
Amazon S3 Bucket
+
+
+
+
+
+
Our underlying data model
+
Relational Database: Data is organized into multiple tables. Tables are connected via columns that share the same elements across tables.
Your turn: Count the distinct values of gender_source_value in person.
+
+
+
+
Revisiting DESCRIBE
+
One of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.
+
+
DESCRIBE person
+
+
+
+
+
Displaying records 1 - 10
+
+
+
column_name
+
column_type
+
null
+
key
+
default
+
extra
+
+
+
+
+
person_id
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
gender_concept_id
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
year_of_birth
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
month_of_birth
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
day_of_birth
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
birth_datetime
+
TIMESTAMP
+
YES
+
NA
+
NA
+
NA
+
+
+
race_concept_id
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
ethnicity_concept_id
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
location_id
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
provider_id
+
INTEGER
+
YES
+
NA
+
NA
+
NA
+
+
+
+
+
+
+
We'll see that primary keys need to be unique (so they can map to each row).
+
+
+
+
Always close the connection
+
When we’re done, it’s best to close the connection with dbDisconnect().
+
+
dbDisconnect(con)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/slides/lesson1_slides.qmd b/slides/lesson1_slides.qmd
new file mode 100644
index 0000000..2ff6c3e
--- /dev/null
+++ b/slides/lesson1_slides.qmd
@@ -0,0 +1,357 @@
+---
+title: "W1: Database Concepts, DESCRIBE, SELECT, WHERE"
+format:
+ revealjs:
+ smaller: true
+ scrollable: true
+ echo: true
+ embed-resources: true
+output-location: fragment
+---
+
+## Welcome!
+
+{width="400"}
+
+Please [sign-up for an account at Posit Cloud](https://login.posit.cloud/register "https://login.posit.cloud/register") and accept our classroom invitation here:
+
+## Introductions
+
+- Who am I?
+
+. . .
+
+- What is [DaSL](https://hutchdatascience.org/) / [OCDO](https://ocdo.fredhutch.org/) ?
+
+. . .
+
+- Who are you?
+
+ - Name, pronouns, group you work in
+
+ - What you want to get out of the class
+
+ - What has brought you joy lately?
+
+. . .
+
+- Our wonderful TAs!
+
+## Goals of the course
+
+. . .
+
+-
+
+. . .
+
+-
+
+## Content of the course
+
+1. Database Concepts, `DESCRIBE`, `SELECT`, `WHERE`
+
+2. `JOIN`ing tables
+
+3. \[No class week\]
+
+4. Calculating new fields, `GROUP BY`, `CASE WHEN`, `HAVING`
+
+5. Subqueries, Views, **Pizza**
+
+## Culture of the course
+
+. . .
+
+- Challenge: We are learning a new language, but you already have a full-time job.
+
+. . .
+
+- *Teach not for mastery, but teach for empowerment to learn effectively.*
+
+. . .
+
+- *Teach at learner's pace.*
+
+## Culture of the course
+
+- Challenge: We sometimes struggle with our data science problems in isolation, unaware that other folks are working on similar things.
+
+. . .
+
+- *We learn and work better with our peers.*
+
+. . .
+
+- *We encourage discussion and questions, as others often have similar questions also.*
+
+## Format of the course
+
+. . .
+
+- Hybrid, and recordings will be available.
+
+. . .
+
+- 1 hour exercises after each session are encouraged for practice.
+
+. . .
+
+- Office hours 11:30-Noon before class.
+
+## Badge of completion
+
+{width="400"}
+
+We offer a [badge of completion](https://www.credly.com/org/fred-hutch/badge/intro-to-sql) when you finish the course!
+
+What it is:
+
+- A display of what you accomplished in the course, shareable in your professional networks such as LinkedIn, similar to online education services such as Coursera.
+
+What it isn't:
+
+- Accreditation through an university or degree-granting program.
+
+. . .
+
+Requirements:
+
+- Complete badge-required sections of the exercises for 3 out of 4 assignments.
+
+## Databases...
+
+- What are some Databases you are interested in?
+
+. . .
+
+- Why do we need a Database Management System (DBMS) to manage it? (What could go wrong in managing a spreadsheet?)
+
+. . .
+
+Benefits of a DBMS:
+
+- **Data Integrity:** What are the rules within the database? If it is a medical database, does a patient always have a visit site? How do we deal with missing data? Are duplicated entries allowed?
+
+. . .
+
+- **Implementation:** How do you find a particular record? What if we now want to create a new application that uses the same database? What if that application is running on a different machine?
+
+. . .
+
+- **Durability:** What if the machine crashes while our program is updating a record? What if we want to replicate the database on multiple machines?
+
+## Database Management System (DBMS) consists of
+
+- **A user interface** - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language).
+
+- **An execution engine** - a software system that queries the data in storage. These can live on our machine, on a server within our network, or a server on the cloud.
+
+- **Data Storage** - the physical location where the data is stored.
+
+## DBMS examples
+
+| | This class | Example Hutch on-site database system | Example Hutch cloud database system |
+|-----------------|-----------------|---------------------|-------------------|
+| **User Interface** | SQL | SQL | SQL |
+| **Execution Engine** | DuckDB | SQL Server | Databrick/Snowflake |
+| **Data Storage** | File on our machine | FH Shared Storage | Amazon S3 Bucket |
+
+## Our underlying data model
+
+Relational Database: Data is organized into multiple tables. Tables are connected via columns that share the same elements across tables.
+
+. . .
+
+Person table
+
+| person_id | year_of_birth | gender_source_value |
+|-----------|---------------|---------------------|
+| 001 | 1/1/1999 | F |
+| 002 | 12/31/1999 | F |
+| 003 | 6/1/2000 | M |
+
+. . .
+
+Procedure Occurrence table
+
+| procedure_occurrence_id | person_id | procedure_datetime |
+|-------------------------|-----------|--------------------|
+| 101 | 001 | 4/1/2010 |
+| 102 | 003 | 6/1/2022 |
+| 103 | 004 | 5/1/2001 |
+
+. . .
+
+**Entity Relationship Diagram**
+
+{width="550"}
+
+## Let's get started: connecting to the database
+
+```{r, warning=FALSE}
+library(DBI)
+
+con <- DBI::dbConnect(duckdb::duckdb(), "../data/GiBleed_5.3_1.1.duckdb")
+```
+
+## What are the available tables?
+
+```{sql connection="con"}
+SHOW TABLES
+```
+
+## Describing a table
+
+```{sql connection="con"}
+DESCRIBE person
+```
+
+## Data Types
+
+If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types:
+
+- `INTEGER`
+- `TIMESTAMP`
+- `DATE`
+- `VARCHAR`
+
+You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html).
+
+## `SELECT` and `FROM`
+
+`SELECT` is a clause that lets you pick out columns of interest. If you want all columns, use `*`.
+
+`FROM` is a clause that lets you decide which table to work with.
+
+. . .
+
+```{sql connection="con"}
+SELECT *
+ FROM person
+ LIMIT 10;
+```
+
+. . .
+
+`LIMIT n` let's you look at the first n entries.
+
+We put multiple SQL **clauses** together to form a **query**.
+
+. . .
+
+Try it out yourself on `procedure_occurrence` table. Why is there a `person_id` column in this table as well?
+
+## `SELECT` for specific columns
+
+Instead of `*` for all columns, we can specify the columns of interest:
+
+```{sql connection="con"}
+SELECT person_id, birth_datetime, gender_concept_id
+ FROM person
+ LIMIT 10;
+```
+
+. . .
+
+Try add `race_concept_id` and `year_of_birth` to your `SELECT` query.
+
+## `WHERE` - filtering our table
+
+Adding `WHERE` to our SQL statement lets us add filtering to our query:
+
+```{sql}
+#| connection: "con"
+SELECT person_id, gender_source_value, race_source_value, year_of_birth
+ FROM person
+ WHERE year_of_birth < 2000
+```
+
+. . .
+
+You don't need to include the columns you're filtering via `WHERE` in the `SELECT` part of the statement:
+
+```{sql}
+#| connection: "con"
+SELECT person_id, gender_source_value, race_source_value
+ FROM person
+ WHERE year_of_birth < 2000
+```
+
+## Single quotes and `WHERE`
+
+Single quotes ('M') refer to values, and double quotes refer to columns ("person_id").
+
+This will trip you up several times if you're not used to it.
+
+```{sql}
+#| connection: "con"
+SELECT person_id, gender_source_value
+ FROM person
+ WHERE gender_source_value = 'M'
+ LIMIT 10;
+```
+
+## `COUNT` - how many entries?
+
+Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for.
+
+```{sql}
+#| connection: "con"
+SELECT COUNT(*)
+ FROM procedure_occurrence;
+```
+
+. . .
+
+Similarly, when we want to count the number of `person_id`s returned, we can use `COUNT(person_id)`:
+
+```{sql}
+#| connection: "con"
+SELECT COUNT(procedure_concept_id)
+ FROM procedure_occurrence;
+```
+
+## `COUNT DISTINCT` for unique entries
+
+When you have repeated values, `COUNT(DISTINCT )` can help you find the number of unique values in a column:
+
+```{sql}
+#| connection: "con"
+SELECT COUNT(DISTINCT procedure_concept_id)
+ FROM procedure_occurrence
+```
+
+. . .
+
+We can also return the actual `DISTINCT` values by removing `COUNT`:
+
+```{sql}
+#| connection: "con"
+SELECT DISTINCT procedure_concept_id
+ FROM procedure_occurrence;
+```
+
+. . .
+
+Your turn: Count the distinct values of `gender_source_value` in `person.`
+
+## Revisiting `DESCRIBE`
+
+One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*.
+
+```{sql connection="con"}
+DESCRIBE person
+```
+
+. . .
+
+We\'ll see that primary keys need to be unique (so they can map to each row).
+
+## Always close the connection
+
+When we're done, it's best to close the connection with `dbDisconnect()`.
+
+```{r}
+dbDisconnect(con)
+```
diff --git a/week1-exercises.qmd b/week1-exercises.qmd
index c8cb6a6..0948a1f 100644
--- a/week1-exercises.qmd
+++ b/week1-exercises.qmd
@@ -11,7 +11,6 @@ We'll first connect to the database:
#| context: setup
library(duckdb)
library(DBI)
-library(DiagrammeR)
con <- DBI::dbConnect(duckdb::duckdb(),
"data/synthea-smaller_breast_cancer.db")
@@ -32,17 +31,16 @@ Show the first 10 rows of `person`.
```{sql}
#| connection: "con"
-SELECT * FROM
- ---------
+SELECT ------
+ FROM ---------
LIMIT ----;
```
-
How many people (or rows) are in the person table?
```{sql}
#| connection: "con"
-SELECT COUNT(-) FROM person;
+SELECT COUNT(----) FROM person;
```
How many people are born after 1980?
@@ -50,7 +48,8 @@ How many people are born after 1980?
```{sql}
#| connection: "con"
#| eval: false
-SELECT COUNT(person_id) FROM person
+SELECT COUNT(person_id)
+ FROM person
WHERE year_of_birth -------;
```
@@ -59,8 +58,9 @@ How about how many people who have `gender_source_value` of 'M'? (Hint: remember
```{sql}
#| connection: "con"
#| eval: false
-SELECT COUNT(person_id) FROM person
- WHERE gender_source_value = ----
+SELECT COUNT(person_id)
+ FROM person
+ WHERE ---- = ----
```
Ok, we now have a better idea of what is in the `person` table. Let's take a deeper dive into the `concept` table.
@@ -71,7 +71,7 @@ Ok, we now have a better idea of what is in the `person` table. Let's take a dee
```{sql}
#| connection: "con"
-DESCRIBE concept;
+DESCRIBE -----;
```
Select the distinct `domain_id`s from the `concept` table:
@@ -86,7 +86,8 @@ Return the number of distinct `concept_name`s with `domain_id` equal to `'Proced
```{sql}
#| connection: "con"
-SELECT COUNT(concept_name) FROM concept
+SELECT COUNT(concept_name)
+ FROM concept
WHERE -----------;
```
@@ -107,8 +108,8 @@ How many distinct `procedure_concept_id`s are there in this `procedure_occurrenc
```{sql}
#| connection: "con"
-SELECT COUNT(DISTINCT procedure_concept_id)
- FROM procedure_occurrence;
+SELECT COUNT(DISTINCT ----)
+ FROM ----;
```
@@ -141,4 +142,4 @@ When you're done with your assignment, run the below code chunk to disconnect fr
```{r}
dbDisconnect(con)
-```
\ No newline at end of file
+```
diff --git a/week1.qmd b/week1.qmd
index cd9dca6..0743a06 100644
--- a/week1.qmd
+++ b/week1.qmd
@@ -1,9 +1,9 @@
---
-title: "Week 1: `DESCRIBE`, `SELECT`, `WHERE`"
+title: "Week 1: DESCRIBE, SELECT, WHERE"
format: html
---
-## Our Composable Database System
+## Our Database Management System (DBMS) for this course
- Client: R/RStudio w/ SQL
- Database Engine: DuckDB
@@ -18,13 +18,11 @@ To access the data, we need to create a database connection. We use `dbConnect()
library(duckdb)
library(DBI)
-con <- DBI::dbConnect(duckdb::duckdb(),
- "data/GiBleed_5.3_1.1.duckdb")
+con <- DBI::dbConnect(duckdb::duckdb(), "data/GiBleed_5.3_1.1.duckdb")
```
Once open, we can use `con` (our database connection)
-::: callout-note
## Keep in Mind: SQL ignores letter case
These are the same to the database engine:
@@ -38,7 +36,6 @@ select PERSON_ID FROM person;
```
And so on. Our convention is that we capitalize SQL clauses such as `SELECT` so you can differentiate them from other information.
-:::
## Looking at the Entire Database
@@ -60,10 +57,31 @@ We'll look at a few tables in our work:
- `person` - Contains personal & demographic data
- `procedure_occurrence` - procedures performed on patients and when they happened
-- `condition_occurrence` - patient conditions (such as illnesses) and when they occurred
- `concept` - contains the specific information (names of concepts) that map into all three above tables
-We'll talk much more later about the relationships between these tables.
+## Describing a table
+
+We can use `DESCRIBE` to get more information (the metadata) about a table.
+
+```{sql}
+#| connection: "con"
+DESCRIBE person
+```
+
+We will pay attention to `column_name` and `column_type` for the moment.
+
+## Data Types
+
+If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types:
+
+- `INTEGER`
+- `TIMESTAMP`
+- `DATE`
+- `VARCHAR`
+
+Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates.
+
+You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html).
## `SELECT` and `FROM`
@@ -77,11 +95,11 @@ SELECT * # select all columns
```{sql}
#| connection: "con"
-SELECT * FROM person LIMIT 10;
+SELECT *
+ FROM person
+ LIMIT 10;
```
-1. Why are there `birth_datetime` and the `month_of_birth`, `day_of_birth`, `year_of_birth` - aren't these redundant?
-
## Try it Out
Look at the first few rows of `procedure_occurrence`.
@@ -89,7 +107,9 @@ Look at the first few rows of `procedure_occurrence`.
```{sql}
#| eval: FALSE
#| connection: "con"
-SELECT * FROM ____ LIMIT 10;
+SELECT *
+ FROM ____
+ LIMIT 10;
```
1. Why is there a `person_id` column in this table as well?
@@ -140,7 +160,7 @@ Adding `WHERE` to our SQL statement lets us add filtering to our query:
#| connection: "con"
SELECT person_id, gender_source_value, race_source_value, year_of_birth
FROM person
- WHERE year_of_birth < 1980
+ WHERE year_of_birth < 2000
```
One critical thing to know is that you don't need to include the columns you're filtering on in the `SELECT` part of the statement. For example, we could do the following as well, removing `year_of_birth` from our `SELECT`:
@@ -160,7 +180,7 @@ This will trip you up several times if you're not used to it.
```{sql}
#| connection: "con"
-SELECT person_id, gender_source_value, race_source_value
+SELECT person_id, gender_source_value
FROM person
WHERE gender_source_value = 'M'
LIMIT 10;
@@ -168,7 +188,6 @@ SELECT person_id, gender_source_value, race_source_value
Reminder: use single ('') quotes in your SQL statements to refer to values, not double quotes (").
-::: callout-note
### Quick Note
For R users, notice the similarity of `select()` with `SELECT`. We can rewrite the above in `dplyr` code as:
@@ -179,39 +198,26 @@ person |>
```
A lot of `dplyr` was inspired by SQL. In fact, there is a package called `dbplyr` that translates `dplyr` statements into SQL. A lot of us use it, and it's pretty handy.
-:::
-## `COUNT` - how many rows?
+## `COUNT` - how many entries?
Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for.
```{sql}
#| connection: "con"
-SELECT COUNT(*)
- FROM person
- WHERE year_of_birth < 2000;
+SELECT COUNT(*)
+ FROM procedure_occurrence;
```
Similarly, when we want to count the number of `person_id`s returned, we can use `COUNT(person_id)`:
-```{sql}
-#| connection: "con"
-SELECT COUNT(person_id)
- FROM person
- WHERE year_of_birth < 2000;
-```
-
-Let's switch gears to the `procedure_concept_id` table. Let's count the overall number of `procedure_concept_id`s in our table:
-
```{sql}
#| connection: "con"
SELECT COUNT(procedure_concept_id)
FROM procedure_occurrence;
```
-Hmmm. That's quite a lot, but are there repeat `procedure_concept_id`s?
-
-When you have repeated values in the rows, `COUNT(DISTINCT )` can help you find the number of unique values in a column:
+There are repeat `procedure_concept_id`s in the `procedure_occurrence` table. When you have repeated values in the rows, `COUNT(DISTINCT )` can help you find the number of unique values in a column:
```{sql}
#| connection: "con"
@@ -233,22 +239,20 @@ Count the distinct values of `gender_source_value` in `person`:
```{sql}
#| connection: "con"
-#| eval: false
-SELECT COUNT(DISTINCT --------------)
- FROM -------;
+
```
-## Keys: Linking tables together
+## Revisiting `DESCRIBE`
-One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*.
-
-We can use `DESCRIBE` to get more information (the metadata) about a table. This gives us information about our tables.
+Let's return to our table metadata and look at it more in depth:
```{sql}
#| connection: "con"
DESCRIBE person
```
+One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*.
+
Scanning the rows, which field/column is the primary key for `person`?
Try and find the *primary key* for `procedure_occurrence`. What is it?
@@ -258,22 +262,9 @@ Try and find the *primary key* for `procedure_occurrence`. What is it?
DESCRIBE procedure_occurrence
```
-We'll see that keys need to be unique (so they can map to each row). In fact, each key is a way to connect one table to another.
+We\'ll see that primary keys need to be unique (so they can map to each row).
-What column is the same in both tables? That is a hint for what we'll cover next week: `JOIN`ing tables.
-
-## Data Types
-
-If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types:
-
-- `INTEGER`
-- `TIMESTAMP`
-- `DATE`
-- `VARCHAR`
-
-Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates.
-
-You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html).
+What column is the same in both tables? That is a hint for what we\'ll cover next week: `JOIN`ing tables.
## Always close the connection