diff --git a/img/Intro_to_Databases.png b/img/Intro_to_Databases.png new file mode 100644 index 0000000..a29a9b9 Binary files /dev/null and b/img/Intro_to_Databases.png differ diff --git a/img/omop0.png b/img/omop0.png new file mode 100644 index 0000000..f525729 Binary files /dev/null and b/img/omop0.png differ diff --git a/img/omop1.png b/img/omop1.png index 3664a3e..6664548 100644 Binary files a/img/omop1.png and b/img/omop1.png differ diff --git a/slides/images/Intro_to_Databases.png b/slides/images/Intro_to_Databases.png new file mode 100644 index 0000000..a29a9b9 Binary files /dev/null and b/slides/images/Intro_to_Databases.png differ diff --git a/slides/lesson1_slides.html b/slides/lesson1_slides.html new file mode 100644 index 0000000..b0e4765 --- /dev/null +++ b/slides/lesson1_slides.html @@ -0,0 +1,3788 @@ + + + + + + + + + + + + + W1: Database Concepts, DESCRIBE, SELECT, WHERE + + + + + + + + + + + + + + + +
+
+ +
+

W1: Database Concepts, DESCRIBE, SELECT, WHERE

+ +
+
+ +
+
+

Welcome!

+ +

Please sign-up for an account at Posit Cloud and accept our classroom invitation here: https://posit.cloud/spaces/689711/join?access_code=8kse5IYlL4kHIqZvKaQ6mXp8IMibFayMa10I8Izn

+
+
+

Introductions

+
    +
  • Who am I?
  • +
+
+ +
+
+
    +
  • Who are you?

    +
      +
    • Name, pronouns, group you work in

    • +
    • What you want to get out of the class

    • +
    • What has brought you joy lately?

    • +
  • +
+
+
+
    +
  • Our wonderful TAs!
  • +
+
+
+
+

Goals of the course

+
+
    +
  • +
+
+
+
    +
  • +
+
+
+
+

Content of the course

+
    +
  1. Database Concepts, DESCRIBE, SELECT, WHERE

  2. +
  3. JOINing tables

  4. +
  5. [No class week]

  6. +
  7. Calculating new fields, GROUP BY, CASE WHEN, HAVING

  8. +
  9. Subqueries, Views, Pizza

  10. +
+
+
+

Culture of the course

+
+
    +
  • Challenge: We are learning a new language, but you already have a full-time job.
  • +
+
+
+
    +
  • Teach not for mastery, but teach for empowerment to learn effectively.
  • +
+
+
+
    +
  • Teach at learner’s pace.
  • +
+
+
+
+

Culture of the course

+
    +
  • Challenge: We sometimes struggle with our data science problems in isolation, unaware that other folks are working on similar things.
  • +
+
+
    +
  • We learn and work better with our peers.
  • +
+
+
+
    +
  • We encourage discussion and questions, as others often have similar questions also.
  • +
+
+
+
+

Format of the course

+
+
    +
  • Hybrid, and recordings will be available.
  • +
+
+
+
    +
  • 1 hour exercises after each session are encouraged for practice.
  • +
+
+
+
    +
  • Office hours 11:30-Noon before class.
  • +
+
+
+
+

Badge of completion

+ +

We offer a badge of completion when you finish the course!

+

What it is:

+
    +
  • A display of what you accomplished in the course, shareable in your professional networks such as LinkedIn, similar to online education services such as Coursera.
  • +
+

What it isn’t:

+
    +
  • Accreditation through an university or degree-granting program.
  • +
+
+

Requirements:

+
    +
  • Complete badge-required sections of the exercises for 3 out of 4 assignments.
  • +
+
+
+
+

Databases…

+
    +
  • What are some Databases you are interested in?
  • +
+
+
    +
  • Why do we need a Database Management System (DBMS) to manage it? (What could go wrong in managing a spreadsheet?)
  • +
+
+
+

Benefits of a DBMS:

+
    +
  • Data Integrity: What are the rules within the database? If it is a medical database, does a patient always have a visit site? How do we deal with missing data? Are duplicated entries allowed?
  • +
+
+
+
    +
  • Implementation: How do you find a particular record? What if we now want to create a new application that uses the same database? What if that application is running on a different machine?
  • +
+
+
+
    +
  • Durability: What if the machine crashes while our program is updating a record? What if we want to replicate the database on multiple machines?
  • +
+
+
+
+

Database Management System (DBMS) consists of

+
    +
  • A user interface - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language).

  • +
  • An execution engine - a software system that queries the data in storage. These can live on our machine, on a server within our network, or a server on the cloud.

  • +
  • Data Storage - the physical location where the data is stored.

  • +
+
+
+

DBMS examples

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
This classExample Hutch on-site database systemExample Hutch cloud database system
User InterfaceSQLSQLSQL
Execution EngineDuckDBSQL ServerDatabrick/Snowflake
Data StorageFile on our machineFH Shared StorageAmazon S3 Bucket
+
+
+

Our underlying data model

+

Relational Database: Data is organized into multiple tables. Tables are connected via columns that share the same elements across tables.

+
+

Person table

+ + + + + + + + + + + + + + + + + + + + + + + + + +
person_idyear_of_birthgender_source_value
0011/1/1999F
00212/31/1999F
0036/1/2000M
+
+
+

Procedure Occurrence table

+ + + + + + + + + + + + + + + + + + + + + + + + + +
procedure_occurrence_idperson_idprocedure_datetime
1010014/1/2010
1020036/1/2022
1030045/1/2001
+
+
+

Entity Relationship Diagram

+

+
+
+
+

Let’s get started: connecting to the database

+
+
library(DBI)
+
+con <- DBI::dbConnect(duckdb::duckdb(), "../data/GiBleed_5.3_1.1.duckdb")
+
+
+ +
+
+
+

What are the available tables?

+
+
SHOW TABLES
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
name
care_site
cdm_source
concept
concept_ancestor
concept_class
concept_relationship
concept_synonym
condition_era
condition_occurrence
cost
+
+
+
+
+

Describing a table

+
+
DESCRIBE person
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
column_namecolumn_typenullkeydefaultextra
person_idINTEGERYESNANANA
gender_concept_idINTEGERYESNANANA
year_of_birthINTEGERYESNANANA
month_of_birthINTEGERYESNANANA
day_of_birthINTEGERYESNANANA
birth_datetimeTIMESTAMPYESNANANA
race_concept_idINTEGERYESNANANA
ethnicity_concept_idINTEGERYESNANANA
location_idINTEGERYESNANANA
provider_idINTEGERYESNANANA
+
+
+
+
+

Data Types

+

If you look at the column_type for one of the DESCRIBE statements above, you’ll notice there are different data types:

+
    +
  • INTEGER
  • +
  • TIMESTAMP
  • +
  • DATE
  • +
  • VARCHAR
  • +
+

You can see all of the datatypes that are available in DuckDB here.

+
+
+

SELECT and FROM

+

SELECT is a clause that lets you pick out columns of interest. If you want all columns, use *.

+

FROM is a clause that lets you decide which table to work with.

+
+
+
SELECT * 
+  FROM person 
+  LIMIT 10;
+
+
+
+ + ++++++++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
person_idgender_concept_idyear_of_birthmonth_of_birthday_of_birthbirth_datetimerace_concept_idethnicity_concept_idlocation_idprovider_idcare_site_idperson_source_valuegender_source_valuegender_source_concept_idrace_source_valuerace_source_concept_idethnicity_source_valueethnicity_source_concept_id
68532196312311963-12-3185160NANANA001f4a87-70d0-435c-a4b9-1425f6928d33F0black0west_indian0
123850719504121950-04-1285270NANANA052d9254-80e8-428f-b8b6-69518b0ef3f3M0white0italian0
129850719741071974-10-0785270NANANA054d32d5-904f-4df4-846b-8c08d165b4e9M0white0polish0
168532197110131971-10-1385270NANANA00444703-f2c9-45c9-a247-f6317a43a929F0white0american0
65853219673311967-03-3185160NANANA02a3dad9-f9d5-42fb-8074-c16d45b4f5c8F0black0dominican0
7485321972151972-01-0585270NANANA02fbf1be-29b7-4da8-8bbd-14c7433f843fF0white0english0
42853219091121909-11-0285270NANANA0177d2e0-98f5-4f3d-bcfd-497b7a07b3f8F0white0irish0
187850719457231945-07-2385270NANANA07a1e14d-73ed-4d3a-9a39-d729745773faM0white0irish0
188532196511171965-11-1785270NANANA0084b0fe-e30f-4930-b6d1-5e1eff4b7deaF0white0english0
11185321975521975-05-0285270NANANA0478d6b3-bdb3-4574-9b93-cf448d725b84F0white0english0
+
+
+
+
+

LIMIT n let’s you look at the first n entries.

+

We put multiple SQL clauses together to form a query.

+
+
+

Try it out yourself on procedure_occurrence table. Why is there a person_id column in this table as well?

+
+
+
+

SELECT for specific columns

+

Instead of * for all columns, we can specify the columns of interest:

+
+
SELECT person_id, birth_datetime, gender_concept_id 
+  FROM person
+  LIMIT 10;
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
person_idbirth_datetimegender_concept_id
61963-12-318532
1231950-04-128507
1291974-10-078507
161971-10-138532
651967-03-318532
741972-01-058532
421909-11-028532
1871945-07-238507
181965-11-178532
1111975-05-028532
+
+
+
+

Try add race_concept_id and year_of_birth to your SELECT query.

+
+
+
+

WHERE - filtering our table

+

Adding WHERE to our SQL statement lets us add filtering to our query:

+
+
SELECT person_id, gender_source_value, race_source_value, year_of_birth 
+  FROM person 
+  WHERE year_of_birth < 2000
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
person_idgender_source_valuerace_source_valueyear_of_birth
6Fblack1963
123Mwhite1950
129Mwhite1974
16Fwhite1971
65Fblack1967
74Fwhite1972
42Fwhite1909
187Mwhite1945
18Fwhite1965
111Fwhite1975
+
+
+
+

You don’t need to include the columns you’re filtering via WHERE in the SELECT part of the statement:

+
+
SELECT person_id, gender_source_value, race_source_value 
+  FROM person 
+  WHERE year_of_birth < 2000
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
person_idgender_source_valuerace_source_value
6Fblack
123Mwhite
129Mwhite
16Fwhite
65Fblack
74Fwhite
42Fwhite
187Mwhite
18Fwhite
111Fwhite
+
+
+
+
+
+

Single quotes and WHERE

+

Single quotes (‘M’) refer to values, and double quotes refer to columns (“person_id”).

+

This will trip you up several times if you’re not used to it.

+
+
SELECT person_id, gender_source_value
+  FROM person 
+  WHERE gender_source_value = 'M'
+  LIMIT 10;
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
person_idgender_source_value
123M
129M
187M
40M
53M
78M
69M
248M
105M
49M
+
+
+
+
+

COUNT - how many entries?

+

Sometimes you want to know the size of your result, not necessarily return the entire set of results. That is what COUNT is for.

+
+
SELECT COUNT(*)
+  FROM procedure_occurrence;
+
+
+
+ + + + + + + + + + + + +
1 records
count_star()
37409
+
+
+
+

Similarly, when we want to count the number of person_ids returned, we can use COUNT(person_id):

+
+
SELECT COUNT(procedure_concept_id)
+  FROM procedure_occurrence;
+
+
+
+ + + + + + + + + + + + +
1 records
count(procedure_concept_id)
37409
+
+
+
+
+
+

COUNT DISTINCT for unique entries

+

When you have repeated values, COUNT(DISTINCT ) can help you find the number of unique values in a column:

+
+
SELECT COUNT(DISTINCT procedure_concept_id)
+  FROM procedure_occurrence
+
+
+
+ + + + + + + + + + + + +
1 records
count(DISTINCT procedure_concept_id)
51
+
+
+
+

We can also return the actual DISTINCT values by removing COUNT:

+
+
SELECT DISTINCT procedure_concept_id
+  FROM procedure_occurrence;
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
procedure_concept_id
4058899
4295880
4216130
4024289
4202451
4330583
4238715
4186930
4242997
4047491
+
+
+
+
+

Your turn: Count the distinct values of gender_source_value in person.

+
+
+
+

Revisiting DESCRIBE

+

One of the important properties of data in a relational database is that there are no repeat rows in the database. Each table that meets this restriction has what is called a primary key.

+
+
DESCRIBE person
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Displaying records 1 - 10
column_namecolumn_typenullkeydefaultextra
person_idINTEGERYESNANANA
gender_concept_idINTEGERYESNANANA
year_of_birthINTEGERYESNANANA
month_of_birthINTEGERYESNANANA
day_of_birthINTEGERYESNANANA
birth_datetimeTIMESTAMPYESNANANA
race_concept_idINTEGERYESNANANA
ethnicity_concept_idINTEGERYESNANANA
location_idINTEGERYESNANANA
provider_idINTEGERYESNANANA
+
+
+
+

We'll see that primary keys need to be unique (so they can map to each row).

+
+
+
+

Always close the connection

+

When we’re done, it’s best to close the connection with dbDisconnect().

+
+
dbDisconnect(con)
+
+
+ +
+
+ +
+
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/slides/lesson1_slides.qmd b/slides/lesson1_slides.qmd new file mode 100644 index 0000000..2ff6c3e --- /dev/null +++ b/slides/lesson1_slides.qmd @@ -0,0 +1,357 @@ +--- +title: "W1: Database Concepts, DESCRIBE, SELECT, WHERE" +format: + revealjs: + smaller: true + scrollable: true + echo: true + embed-resources: true +output-location: fragment +--- + +## Welcome! + +![](images/Intro_to_Databases.png){width="400"} + +Please [sign-up for an account at Posit Cloud](https://login.posit.cloud/register "https://login.posit.cloud/register") and accept our classroom invitation here: + +## Introductions + +- Who am I? + +. . . + +- What is [DaSL](https://hutchdatascience.org/) / [OCDO](https://ocdo.fredhutch.org/) ? + +. . . + +- Who are you? + + - Name, pronouns, group you work in + + - What you want to get out of the class + + - What has brought you joy lately? + +. . . + +- Our wonderful TAs! + +## Goals of the course + +. . . + +- + +. . . + +- + +## Content of the course + +1. Database Concepts, `DESCRIBE`, `SELECT`, `WHERE` + +2. `JOIN`ing tables + +3. \[No class week\] + +4. Calculating new fields, `GROUP BY`, `CASE WHEN`, `HAVING` + +5. Subqueries, Views, **Pizza** + +## Culture of the course + +. . . + +- Challenge: We are learning a new language, but you already have a full-time job. + +. . . + +- *Teach not for mastery, but teach for empowerment to learn effectively.* + +. . . + +- *Teach at learner's pace.* + +## Culture of the course + +- Challenge: We sometimes struggle with our data science problems in isolation, unaware that other folks are working on similar things. + +. . . + +- *We learn and work better with our peers.* + +. . . + +- *We encourage discussion and questions, as others often have similar questions also.* + +## Format of the course + +. . . + +- Hybrid, and recordings will be available. + +. . . + +- 1 hour exercises after each session are encouraged for practice. + +. . . + +- Office hours 11:30-Noon before class. + +## Badge of completion + +![](images/Intro_to_Databases.png){width="400"} + +We offer a [badge of completion](https://www.credly.com/org/fred-hutch/badge/intro-to-sql) when you finish the course! + +What it is: + +- A display of what you accomplished in the course, shareable in your professional networks such as LinkedIn, similar to online education services such as Coursera. + +What it isn't: + +- Accreditation through an university or degree-granting program. + +. . . + +Requirements: + +- Complete badge-required sections of the exercises for 3 out of 4 assignments. + +## Databases... + +- What are some Databases you are interested in? + +. . . + +- Why do we need a Database Management System (DBMS) to manage it? (What could go wrong in managing a spreadsheet?) + +. . . + +Benefits of a DBMS: + +- **Data Integrity:** What are the rules within the database? If it is a medical database, does a patient always have a visit site? How do we deal with missing data? Are duplicated entries allowed? + +. . . + +- **Implementation:** How do you find a particular record? What if we now want to create a new application that uses the same database? What if that application is running on a different machine? + +. . . + +- **Durability:** What if the machine crashes while our program is updating a record? What if we want to replicate the database on multiple machines? + +## Database Management System (DBMS) consists of + +- **A user interface** - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language). + +- **An execution engine** - a software system that queries the data in storage. These can live on our machine, on a server within our network, or a server on the cloud. + +- **Data Storage** - the physical location where the data is stored. + +## DBMS examples + +| | This class | Example Hutch on-site database system | Example Hutch cloud database system | +|-----------------|-----------------|---------------------|-------------------| +| **User Interface** | SQL | SQL | SQL | +| **Execution Engine** | DuckDB | SQL Server | Databrick/Snowflake | +| **Data Storage** | File on our machine | FH Shared Storage | Amazon S3 Bucket | + +## Our underlying data model + +Relational Database: Data is organized into multiple tables. Tables are connected via columns that share the same elements across tables. + +. . . + +Person table + +| person_id | year_of_birth | gender_source_value | +|-----------|---------------|---------------------| +| 001 | 1/1/1999 | F | +| 002 | 12/31/1999 | F | +| 003 | 6/1/2000 | M | + +. . . + +Procedure Occurrence table + +| procedure_occurrence_id | person_id | procedure_datetime | +|-------------------------|-----------|--------------------| +| 101 | 001 | 4/1/2010 | +| 102 | 003 | 6/1/2022 | +| 103 | 004 | 5/1/2001 | + +. . . + +**Entity Relationship Diagram** + +![](../img/omop0.png){width="550"} + +## Let's get started: connecting to the database + +```{r, warning=FALSE} +library(DBI) + +con <- DBI::dbConnect(duckdb::duckdb(), "../data/GiBleed_5.3_1.1.duckdb") +``` + +## What are the available tables? + +```{sql connection="con"} +SHOW TABLES +``` + +## Describing a table + +```{sql connection="con"} +DESCRIBE person +``` + +## Data Types + +If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types: + +- `INTEGER` +- `TIMESTAMP` +- `DATE` +- `VARCHAR` + +You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html). + +## `SELECT` and `FROM` + +`SELECT` is a clause that lets you pick out columns of interest. If you want all columns, use `*`. + +`FROM` is a clause that lets you decide which table to work with. + +. . . + +```{sql connection="con"} +SELECT * + FROM person + LIMIT 10; +``` + +. . . + +`LIMIT n` let's you look at the first n entries. + +We put multiple SQL **clauses** together to form a **query**. + +. . . + +Try it out yourself on `procedure_occurrence` table. Why is there a `person_id` column in this table as well? + +## `SELECT` for specific columns + +Instead of `*` for all columns, we can specify the columns of interest: + +```{sql connection="con"} +SELECT person_id, birth_datetime, gender_concept_id + FROM person + LIMIT 10; +``` + +. . . + +Try add `race_concept_id` and `year_of_birth` to your `SELECT` query. + +## `WHERE` - filtering our table + +Adding `WHERE` to our SQL statement lets us add filtering to our query: + +```{sql} +#| connection: "con" +SELECT person_id, gender_source_value, race_source_value, year_of_birth + FROM person + WHERE year_of_birth < 2000 +``` + +. . . + +You don't need to include the columns you're filtering via `WHERE` in the `SELECT` part of the statement: + +```{sql} +#| connection: "con" +SELECT person_id, gender_source_value, race_source_value + FROM person + WHERE year_of_birth < 2000 +``` + +## Single quotes and `WHERE` + +Single quotes ('M') refer to values, and double quotes refer to columns ("person_id"). + +This will trip you up several times if you're not used to it. + +```{sql} +#| connection: "con" +SELECT person_id, gender_source_value + FROM person + WHERE gender_source_value = 'M' + LIMIT 10; +``` + +## `COUNT` - how many entries? + +Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for. + +```{sql} +#| connection: "con" +SELECT COUNT(*) + FROM procedure_occurrence; +``` + +. . . + +Similarly, when we want to count the number of `person_id`s returned, we can use `COUNT(person_id)`: + +```{sql} +#| connection: "con" +SELECT COUNT(procedure_concept_id) + FROM procedure_occurrence; +``` + +## `COUNT DISTINCT` for unique entries + +When you have repeated values, `COUNT(DISTINCT )` can help you find the number of unique values in a column: + +```{sql} +#| connection: "con" +SELECT COUNT(DISTINCT procedure_concept_id) + FROM procedure_occurrence +``` + +. . . + +We can also return the actual `DISTINCT` values by removing `COUNT`: + +```{sql} +#| connection: "con" +SELECT DISTINCT procedure_concept_id + FROM procedure_occurrence; +``` + +. . . + +Your turn: Count the distinct values of `gender_source_value` in `person.` + +## Revisiting `DESCRIBE` + +One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*. + +```{sql connection="con"} +DESCRIBE person +``` + +. . . + +We\'ll see that primary keys need to be unique (so they can map to each row). + +## Always close the connection + +When we're done, it's best to close the connection with `dbDisconnect()`. + +```{r} +dbDisconnect(con) +``` diff --git a/week1-exercises.qmd b/week1-exercises.qmd index c8cb6a6..0948a1f 100644 --- a/week1-exercises.qmd +++ b/week1-exercises.qmd @@ -11,7 +11,6 @@ We'll first connect to the database: #| context: setup library(duckdb) library(DBI) -library(DiagrammeR) con <- DBI::dbConnect(duckdb::duckdb(), "data/synthea-smaller_breast_cancer.db") @@ -32,17 +31,16 @@ Show the first 10 rows of `person`. ```{sql} #| connection: "con" -SELECT * FROM - --------- +SELECT ------ + FROM --------- LIMIT ----; ``` - How many people (or rows) are in the person table? ```{sql} #| connection: "con" -SELECT COUNT(-) FROM person; +SELECT COUNT(----) FROM person; ``` How many people are born after 1980? @@ -50,7 +48,8 @@ How many people are born after 1980? ```{sql} #| connection: "con" #| eval: false -SELECT COUNT(person_id) FROM person +SELECT COUNT(person_id) + FROM person WHERE year_of_birth -------; ``` @@ -59,8 +58,9 @@ How about how many people who have `gender_source_value` of 'M'? (Hint: remember ```{sql} #| connection: "con" #| eval: false -SELECT COUNT(person_id) FROM person - WHERE gender_source_value = ---- +SELECT COUNT(person_id) + FROM person + WHERE ---- = ---- ``` Ok, we now have a better idea of what is in the `person` table. Let's take a deeper dive into the `concept` table. @@ -71,7 +71,7 @@ Ok, we now have a better idea of what is in the `person` table. Let's take a dee ```{sql} #| connection: "con" -DESCRIBE concept; +DESCRIBE -----; ``` Select the distinct `domain_id`s from the `concept` table: @@ -86,7 +86,8 @@ Return the number of distinct `concept_name`s with `domain_id` equal to `'Proced ```{sql} #| connection: "con" -SELECT COUNT(concept_name) FROM concept +SELECT COUNT(concept_name) + FROM concept WHERE -----------; ``` @@ -107,8 +108,8 @@ How many distinct `procedure_concept_id`s are there in this `procedure_occurrenc ```{sql} #| connection: "con" -SELECT COUNT(DISTINCT procedure_concept_id) - FROM procedure_occurrence; +SELECT COUNT(DISTINCT ----) + FROM ----; ``` @@ -141,4 +142,4 @@ When you're done with your assignment, run the below code chunk to disconnect fr ```{r} dbDisconnect(con) -``` \ No newline at end of file +``` diff --git a/week1.qmd b/week1.qmd index cd9dca6..0743a06 100644 --- a/week1.qmd +++ b/week1.qmd @@ -1,9 +1,9 @@ --- -title: "Week 1: `DESCRIBE`, `SELECT`, `WHERE`" +title: "Week 1: DESCRIBE, SELECT, WHERE" format: html --- -## Our Composable Database System +## Our Database Management System (DBMS) for this course - Client: R/RStudio w/ SQL - Database Engine: DuckDB @@ -18,13 +18,11 @@ To access the data, we need to create a database connection. We use `dbConnect() library(duckdb) library(DBI) -con <- DBI::dbConnect(duckdb::duckdb(), - "data/GiBleed_5.3_1.1.duckdb") +con <- DBI::dbConnect(duckdb::duckdb(), "data/GiBleed_5.3_1.1.duckdb") ``` Once open, we can use `con` (our database connection) -::: callout-note ## Keep in Mind: SQL ignores letter case These are the same to the database engine: @@ -38,7 +36,6 @@ select PERSON_ID FROM person; ``` And so on. Our convention is that we capitalize SQL clauses such as `SELECT` so you can differentiate them from other information. -::: ## Looking at the Entire Database @@ -60,10 +57,31 @@ We'll look at a few tables in our work: - `person` - Contains personal & demographic data - `procedure_occurrence` - procedures performed on patients and when they happened -- `condition_occurrence` - patient conditions (such as illnesses) and when they occurred - `concept` - contains the specific information (names of concepts) that map into all three above tables -We'll talk much more later about the relationships between these tables. +## Describing a table + +We can use `DESCRIBE` to get more information (the metadata) about a table. + +```{sql} +#| connection: "con" +DESCRIBE person +``` + +We will pay attention to `column_name` and `column_type` for the moment. + +## Data Types + +If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types: + +- `INTEGER` +- `TIMESTAMP` +- `DATE` +- `VARCHAR` + +Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates. + +You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html). ## `SELECT` and `FROM` @@ -77,11 +95,11 @@ SELECT * # select all columns ```{sql} #| connection: "con" -SELECT * FROM person LIMIT 10; +SELECT * + FROM person + LIMIT 10; ``` -1. Why are there `birth_datetime` and the `month_of_birth`, `day_of_birth`, `year_of_birth` - aren't these redundant? - ## Try it Out Look at the first few rows of `procedure_occurrence`. @@ -89,7 +107,9 @@ Look at the first few rows of `procedure_occurrence`. ```{sql} #| eval: FALSE #| connection: "con" -SELECT * FROM ____ LIMIT 10; +SELECT * + FROM ____ + LIMIT 10; ``` 1. Why is there a `person_id` column in this table as well? @@ -140,7 +160,7 @@ Adding `WHERE` to our SQL statement lets us add filtering to our query: #| connection: "con" SELECT person_id, gender_source_value, race_source_value, year_of_birth FROM person - WHERE year_of_birth < 1980 + WHERE year_of_birth < 2000 ``` One critical thing to know is that you don't need to include the columns you're filtering on in the `SELECT` part of the statement. For example, we could do the following as well, removing `year_of_birth` from our `SELECT`: @@ -160,7 +180,7 @@ This will trip you up several times if you're not used to it. ```{sql} #| connection: "con" -SELECT person_id, gender_source_value, race_source_value +SELECT person_id, gender_source_value FROM person WHERE gender_source_value = 'M' LIMIT 10; @@ -168,7 +188,6 @@ SELECT person_id, gender_source_value, race_source_value Reminder: use single ('') quotes in your SQL statements to refer to values, not double quotes ("). -::: callout-note ### Quick Note For R users, notice the similarity of `select()` with `SELECT`. We can rewrite the above in `dplyr` code as: @@ -179,39 +198,26 @@ person |> ``` A lot of `dplyr` was inspired by SQL. In fact, there is a package called `dbplyr` that translates `dplyr` statements into SQL. A lot of us use it, and it's pretty handy. -::: -## `COUNT` - how many rows? +## `COUNT` - how many entries? Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for. ```{sql} #| connection: "con" -SELECT COUNT(*) - FROM person - WHERE year_of_birth < 2000; +SELECT COUNT(*) + FROM procedure_occurrence; ``` Similarly, when we want to count the number of `person_id`s returned, we can use `COUNT(person_id)`: -```{sql} -#| connection: "con" -SELECT COUNT(person_id) - FROM person - WHERE year_of_birth < 2000; -``` - -Let's switch gears to the `procedure_concept_id` table. Let's count the overall number of `procedure_concept_id`s in our table: - ```{sql} #| connection: "con" SELECT COUNT(procedure_concept_id) FROM procedure_occurrence; ``` -Hmmm. That's quite a lot, but are there repeat `procedure_concept_id`s? - -When you have repeated values in the rows, `COUNT(DISTINCT )` can help you find the number of unique values in a column: +There are repeat `procedure_concept_id`s in the `procedure_occurrence` table. When you have repeated values in the rows, `COUNT(DISTINCT )` can help you find the number of unique values in a column: ```{sql} #| connection: "con" @@ -233,22 +239,20 @@ Count the distinct values of `gender_source_value` in `person`: ```{sql} #| connection: "con" -#| eval: false -SELECT COUNT(DISTINCT --------------) - FROM -------; + ``` -## Keys: Linking tables together +## Revisiting `DESCRIBE` -One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*. - -We can use `DESCRIBE` to get more information (the metadata) about a table. This gives us information about our tables. +Let's return to our table metadata and look at it more in depth: ```{sql} #| connection: "con" DESCRIBE person ``` +One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*. + Scanning the rows, which field/column is the primary key for `person`? Try and find the *primary key* for `procedure_occurrence`. What is it? @@ -258,22 +262,9 @@ Try and find the *primary key* for `procedure_occurrence`. What is it? DESCRIBE procedure_occurrence ``` -We'll see that keys need to be unique (so they can map to each row). In fact, each key is a way to connect one table to another. +We\'ll see that primary keys need to be unique (so they can map to each row). -What column is the same in both tables? That is a hint for what we'll cover next week: `JOIN`ing tables. - -## Data Types - -If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types: - -- `INTEGER` -- `TIMESTAMP` -- `DATE` -- `VARCHAR` - -Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates. - -You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html). +What column is the same in both tables? That is a hint for what we\'ll cover next week: `JOIN`ing tables. ## Always close the connection