diff --git a/concepts.qmd b/concepts.qmd index 03c1966..99bdc7c 100644 --- a/concepts.qmd +++ b/concepts.qmd @@ -21,19 +21,17 @@ con <- DBI::dbConnect(duckdb::duckdb(), > A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). Together, the data and the DBMS, along with the applications that are associated with them, are referred to as a database system, often shortened to just database. - [Oracle Documentation](https://www.oracle.com/database/what-is-database/) -When we talk about databases, we mean the *database system* rather than database itself. Specifically, we talk about the different layers of a database system. +When we talk about databases, we often mean the *database system* rather than database itself. Specifically, we talk about the different layers of a database system. ## Parts of a Database System The [Composable Codex](https://voltrondata.com/codex/a-new-frontier#structure-of-a-composable-data-system) talks about three layers of a database system: +![](img/composable-data-system-modules.png) [From the Composable Codex](https://voltrondata.com/codex/a-new-frontier#building-a-new-composable-frontier) -![](img/composable-data-system-modules.png) -[From the Composable Codex](https://voltrondata.com/codex/a-new-frontier#building-a-new-composable-frontier) - -1. **A user interface** - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language). -2. **An execution engine** - a software system that queries the data in storage. There are many examples of this: SQL Server, MariaDB, DuckDB, Snowflake. These can live on our machine, on a server within our network, or a server on the cloud. -3. **Data Storage** - the physical location where the data is stored. This could be on your computer, on the network, or in the cloud (such as an Amazon S3 bucket) +1. **A user interface** - how users interact with the database. In this class, our main way of interacting with databases is SQL (Structured Query Language). +2. **An execution engine** - a software system that queries the data in storage. There are many examples of this: SQL Server, MariaDB, DuckDB, Snowflake. These can live on our machine, on a server within our network, or a server on the cloud. +3. **Data Storage** - the physical location where the data is stored. This could be on your computer, on the network, or in the cloud (such as an Amazon S3 bucket) ## For this class @@ -46,7 +44,7 @@ B["2.DuckDB"] --> C C["3.File on our Machine"] ``` -::: {.callout} +::: callout ## Why We're Using DuckDB in this Course DuckDB is a very fast, open-source database engine. Because of restrictions on clinical data, sometimes the only way to analyze it is on an approved laptop. DuckDB does wondrous things on laptops, so we hope it will be a helpful tool in your arsenal. @@ -74,18 +72,30 @@ B["2.Databricks/Snowflake"] --> C C["3.Amazon S3"] ``` -In this case, we need to sign into the Databricks system, which is a set of systems that lives in the cloud. We actually will use SQL within their notebooks to write our queries. Databricks will then use the Snowflake engine to query the data that is stored in cloud storage (an S3 bucket). +In this case, we need to sign into the Databricks system, which is a set of systems that lives in the cloud. We actually will use SQL within their notebooks to write our queries. Databricks will then use the Snowflake engine to query the data that is stored in cloud storage (an S3 bucket). If this is making you dizzy, don't worry too much about it. Just know that we can switch out the different layers based on our needs. +## Our underlying data model + +The three components of our Database System is dependent on our choice of the data model. Most data models are centered around **Relational Databases**. A relational database organizes data into multiple tables. Each table's row is a record with a unique ID called the key, as well as attributes described in the columns. The tables may relate to each other based on columns with the same values. + +Below is an example **entity-relationship diagram** that summarizes relationships between tables: + +![](img/omop1.png) + +Each rectangle represent a table, and within each table are the columns (fields). The connecting lines shows that there are shared values between tables in those columns, which helps one navigate between tables. Don't worry if this feels foreign to you right now - we will unpack these diagrams throughout the course. + +Other data models include **NoSQL ("Not Only SQL")**, which allows the organization of unstructured data via key-value pairs, graphs, and encoding entire documents. Another emerging data model are **Array/Matrix/Vector-based models**, which are largely focused on organizing numerical data for machine learning purposes. + ## What is SQL? -SQL is short for **S**tructured **Q**uery **L**anguage. It is a standardized language for querying databases (originally relational databases) +SQL is short for **S**tructured **Q**uery **L**anguage. It is a standardized language for querying relational databases. SQL lets us do various operations on data. It contains various *clauses* which let us manipulate data: | Priority | Clause | Purpose | -| -------- | ---------- | -------------------------------------------------------------- | +|------------|------------|------------------------------------------------| | 1 | `FROM` | Choose tables to query and specify how to `JOIN` them together | | 2 | `WHERE` | Filter tables based on criteria | | 3 | `GROUP BY` | Aggregates the Data | @@ -99,7 +109,7 @@ We do not use all of these clauses when we write a SQL Query. We only use the on Oftentimes, we really only want a summary out of the database. We would probably use the following clauses: | Priority | Clause | Purpose | -| -------- | ---------- | -------------------------------------------------------------- | +|------------|------------|------------------------------------------------| | 1 | `FROM` | Choose tables to query and specify how to `JOIN` them together | | 2 | `WHERE` | Filter tables based on criteria | | 3 | `GROUP BY` | Aggregates the Data | @@ -107,7 +117,7 @@ Oftentimes, we really only want a summary out of the database. We would probably Notice that there is a **Priority** column in these tables. This is important, because parts of queries are evaluated in this order. -::: {.callout-note} +::: callout-note ## Dialects of SQL You may have heard that the SQL used in SQL Server is different than other databases. In truth, there are multiple dialects of SQL, based on the engine. @@ -119,7 +129,7 @@ However, we're focusing on the 95% of SQL that is common to all systems. Most of Let's look at a typical SQL statement: -```sql +``` sql SELECT person_id, gender_source_value # Choose Columns FROM person # Choose the person table WHERE year_of_birth < 2000; # Filter the data using a criterion @@ -127,7 +137,7 @@ SELECT person_id, gender_source_value # Choose Columns We can read this as: -``` +``` SELECT the person_id and gender_source_value columns FROM the person table ONLY Those with year of birth less than 2000 @@ -135,17 +145,17 @@ ONLY Those with year of birth less than 2000 As you can see, SQL can be read. We will gradually introduce clauses and different database operations. -::: {.callout-note} +::: callout-note As a convention, we will capitalize SQL clauses (such as `SELECT`), and use lowercase for everything else. ::: ## Database Connections -We haven't really talked about how we *connect* to the database engine. +We haven't really talked about how we *connect* to the database engine. In order to connect to the database engine and create a database connection, we may have to authenticate with an ID/password combo or use other methods of authentication to prove who we are. -Once we are authenticated, we now have a connection. This is basically our conduit to the database engine. We can *send* queries through it, and the database engine will run these queries, and **return** a result. +Once we are authenticated, we now have a connection. This is basically our conduit to the database engine. We can *send* queries through it, and the database engine will run these queries, and **return** a result. ```{mermaid} graph LR @@ -155,7 +165,7 @@ graph LR As long as the connection is open, we can continue to send queries and receive results. -It is best practice to explicitly **disconnect** from the database. Once we have disconnected, we no longer have access to the database. +It is best practice to explicitly **disconnect** from the database. Once we have disconnected, we no longer have access to the database. ```{mermaid} graph LR @@ -175,20 +185,20 @@ SELECT * FROM person LIMIT 10; Some quick terminology: -- **Database Record** - a row in this table. In this case, each row in the table above corresponds to a single *person*. -- **Database Field** - the columns in this table. In our case, each column corresponds to a single measurement, such as `birth_datetime`. Each column has a specific datatype, which may be integers, decimals, dates, a short text field, or longer text fields. Think of them like the different pieces of information requested in a form. +- **Database Record** - a row in this table. In this case, each row in the table above corresponds to a single *person*. +- **Database Field** - the columns in this table. In our case, each column corresponds to a single measurement, such as `birth_datetime`. Each column has a specific datatype, which may be integers, decimals, dates, a short text field, or longer text fields. Think of them like the different pieces of information requested in a form. -It is faster and requires less memory if we do not use a single large table, but decompose the data up into *multiple tables*. These tables are stored in a number of different formats: +It is faster and requires less memory if we do not use a single large table, but decompose the data up into *multiple tables*. These tables are stored in a number of different formats: -- Comma Separated Value (CSV) -- A Single File (SQL Server) -- a *virtual file* +- Comma Separated Value (CSV) +- A Single File (SQL Server) +- a *virtual file* -In a virtual file, the data acts like it is stored in a single file, but is actually many different files underneath that can be on your machine, on the network, or on the cloud. The *virtual file* lets us interact with this large mass of data as if it is a single file. +In a virtual file, the data acts like it is stored in a single file, but is actually many different files underneath that can be on your machine, on the network, or on the cloud. The *virtual file* lets us interact with this large mass of data as if it is a single file. The database engine is responsible for scanning the data, either row by row, or column by column. The engines are made to be very fast in this scanning to return relevant records. -:::{.callout} +::: callout ## Rows versus Columns Just a quick note about row-based storage vs column-based storage. SQL was originally written for relational databases, which are stored by row. diff --git a/first-section-new-chapter.qmd b/first-section-new-chapter.qmd deleted file mode 100644 index 48f8218..0000000 --- a/first-section-new-chapter.qmd +++ /dev/null @@ -1,224 +0,0 @@ -# New Chapter - -## Learning Objectives - -Every chapter also needs Learning objectives. - -## Libraries - -For this chapter, we'll need the following packages attached: - -*Remember to add [any additional packages you need to your course's own docker image](https://github.com/jhudsl/OTTR_Template/wiki/Using-Docker#starting-a-new-docker-image). - -```{r} -library(magrittr) -``` - -## Topic of Section - -You can write all your text in sections like this, using `##` to indicate a new header. you can use additional pound symbols to create lower levels of headers. - -See [here](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) for additional general information about how you can format text within R Markdown files. In addition, see [here](https://pandoc.org/MANUAL.html#pandocs-markdown) for more in depth and advanced options. - -### Subtopic - -Here's a subheading (using three pound symbols) and some text in this subsection! - -## Code examples - -You can demonstrate code like this: - -```{r} -output_dir <- file.path("resources", "code_output") -if (!dir.exists(output_dir)) { - dir.create(output_dir) -} -``` - -And make plots too: - -```{r} -hist_plot <- hist(iris$Sepal.Length) -``` - -You can also save these plots to file: - -```{r} -png(file.path(output_dir, "test_plot.png")) -hist_plot -dev.off() -``` - - -## Image example - -How to include a Google slide. It's simplest to use the `ottrpal` package: - -```{r} -#| fig-align: "center" -#| fig-alt: "Major point!! example image" -#| echo: false -#| out-width: "100%" -ottrpal::include_slide("https://docs.google.com/presentation/d/1Dw_rBb1hySN_76xh9-x5J2dWF_das9BAUjQigf2fN-E/edit#slide=id.g252f18e2576_1_0") -``` - -But if you have the slide or some other image locally downloaded you can also use HTML like this: - -Major point!! example image - - -## Video examples - -You may also want to embed videos in your course. If alternatively, you just want to include a link you can do so like this: - -Check out this [link to a video](https://www.youtube.com/embed/VOCYL-FNbr0) using markdown syntax. - -### Using `knitr` - -To embed videos in your course, you can use `knitr::include_url()` like this: -Note that you should use `echo=FALSE` in the code chunk because we don't want the code part of this to show up. If you are unfamiliar with [how R Markdown code chunks work, read this](https://rmarkdown.rstudio.com/lesson-3.html). - - -```{r} -#| echo: false -knitr::include_url("https://www.youtube.com/embed/VOCYL-FNbr0") -``` - -### Using HTML - - - -### Using `knitr` - -```{r, fig.align="center", echo=FALSE, out.width="100%"} -knitr::include_url("https://drive.google.com/file/d/1mm72K4V7fqpgAfWkr6b7HTZrc3f-T6AV/preview") -``` - -### Using HTML - - - -## Website Examples - -Yet again you can use a link to a website like so: - -[A Website](https://yihui.org) - -You might want to have users open a website in a new tab by default, especially if they need to reference both the course and a resource at once. - -[A Website](https://yihui.org){target="_blank"} - -Or, you can embed some websites. - -### Using `knitr` - -This works: - -```{r, fig.align="center", echo=FALSE} -knitr::include_url("https://yihui.org") -``` - - -### Using HTML - - - - -## Stylized boxes - -Occasionally, you might find it useful to emphasize a particular piece of information. To help you do so, we have provided css code and images (no need for you to worry about that!) to create the following stylized boxes. - -You can use these boxes in your course with either of two options: using HTML code or Pandoc syntax. - -### Using `rmarkdown` container syntax - -The `rmarkdown` package allows for a different syntax to be converted to the HTML that you just saw and also allows for conversion to LaTeX. See the [Bookdown](https://bookdown.org/yihui/rmarkdown-cookbook/custom-blocks.html) documentation for more information. Note that Bookdown uses Pandoc. - - -``` -::: {.notice} -Note using rmarkdown syntax. - -::: -``` - -::: {.notice} -Note using rmarkdown syntax. - -::: - -As an example you might do something like this: - -::: {.notice} -Please click on the subsection headers in the left hand -navigation bar (e.g., 2.1, 4.3) a second time to expand the -table of contents and enable the `scroll_highlight` feature -([see more](introduction.html#scroll-highlight)) -::: - - -### Using HTML - -To add a warning box like the following use: - -
-Followed by the text you want inside -
- -This will create the following: - -
- -Followed by the text you want inside - -
- -Here is a `
` box: - -
- -Note text - -
- -Here is a `
` box: - -
- -GitHub text - -
- - -Here is a `
` box: - -
- -dictionary text - -
- - -Here is a `
` box: - -
- -reflection text - -
- - -Here is a `
` box: - -
- -Work in Progress text - -
- - -## Dropdown summaries - -
You can hide additional information in a dropdown menu -Here's more words that are hidden. -
diff --git a/img/omop1.png b/img/omop1.png new file mode 100644 index 0000000..3664a3e Binary files /dev/null and b/img/omop1.png differ diff --git a/img/omop2.png b/img/omop2.png new file mode 100644 index 0000000..07632b2 Binary files /dev/null and b/img/omop2.png differ diff --git a/index.qmd b/index.qmd index 44e5829..05c0eeb 100644 --- a/index.qmd +++ b/index.qmd @@ -6,58 +6,56 @@ Data that we need to utilize and query is often stored in data sources such as d ## Learning Objectives -- **Explain** data sources such as Databases and how to connect to them -- **Query** data sources using database engines and Structured Query Language (SQL) to **filter**, **join**, and **aggregate** data -- **Construct** and **calculate** new fields using `SELECT` or `CASE WHEN` -- (optional) **Read** and **explain** a sample OMOP query: +- **Explain** data sources such as Databases and how to connect to them +- **Query** data sources using database engines and Structured Query Language (SQL) to **filter**, **join**, and **aggregate** data +- **Construct** and **calculate** new fields using `SELECT` or `CASE WHEN` +- (optional) **Read** and **explain** a sample OMOP query: ## Instructors -If you need to schedule some time to talk, please schedule with Ted. +If you need to schedule some time to talk, please schedule with Chris. -- [Ted Laderas](https://laderast.github.io), Director of Training and Community, Office of the Chief Data Officer -- [Vivek Sriram](https://viveksriram.com/), Data Scientist, Office of the Chief Data Officer +- Chris Lo, Data Scientist/Trainer, Office of the Chief Data Officer +- [Vivek Sriram](https://viveksriram.com/), Data Scientist, Office of the Chief Data Officer ## Introductions In chat, please introduce yourself: -- Your Name & Your Group -- What you want to learn in this course -- Favorite Winter activity - +- Your Name & Your Group +- What you want to learn in this course +- Favorite Fall activity ## Tentative Schedule -All classes are on Fridays from 12:00-1:30 PM PST. Connection details will be provided. Office hours related to each class day are posted below, and the invite will be sent to you. +All classes are on Fridays from 12:00-1:30 PM PST. Connection details will be provided. Office hours related to each class day are posted below, and the invite will be sent to you. In class we will be going through the Quarto Notebooks that are hosted on Posit.cloud. No knowledge of R is necessary, we'll show you what you need to know in class. Classes will be recorded, and those recordings will be sent to you after each class. - -| Week | Date | Subject |Office Hours| -| ---- | ------ | ---------------------------------------------------------- |------------| -|Pre-class|----|[Concepts of Databases](concepts.html)| -| 1 | May 8 | [Intro to SQL; `SHOW TABLES`, `DESCRIBE`, `SELECT`, `WHERE`](week1.html) |TBD| -| 2 | May 15 | [`JOIN`ing tables, more `WHERE`](week2.html) |TBD| -hours| -| 3 | May 22 | [Calculating new fields, `GROUP BY`, `CASE WHEN`, `HAVING`](week3.html) |TBD| -| 4 | May 29 | [Subqueries/Views, Recap of course / review OMOP queries](week4.html) |TBD| +| Week | Date | Subject | +|----------|----------|-------------------------------------------| +| Pre-class | ---- | [Concepts of Databases](concepts.html) | +| 1 | Oct 10 | [Intro to SQL; `SHOW TABLES`, `DESCRIBE`, `SELECT`, `WHERE`](week1.html) | +| 2 | Oct 17 | [`JOIN`ing tables, more `WHERE`](week2.html) | +| No classs! | Oct 24 | | +| 3 | Oct 31 | [Calculating new fields, `GROUP BY`, `CASE WHEN`, `HAVING`](week3.html) | +| 4 | Nov 7 | [Subqueries/Views, Recap of course / review OMOP queries](week4.html) | ## Format of Class -I will teach online only, though you have the option of attending in the DaSL Lounge (Arnold M1-B406), which will have snacks and drinks available. Either Chris Lo or Vivek Sriram will host in person. +The course will be taught in a hybrid form. Come to M1-B406 to learn in-person, and enjoy the snacks. Or join online via the Teams on your calendar. -We will spend the first 20-25 minutes of each class on catching up on last week's exercises if you haven't had the opportunity to work on them. Followed by that, we will have a short lecture/lab, where we will go through the notebooks for the week. +We will spend the first 20-25 minutes of each class on catching up on last week's exercises if you haven't had the opportunity to work on them. Followed by that, we will have a short lecture/lab, where we will go through the notebooks for the week. ## First Class Survey -[First Class Survey](https://docs.google.com/forms/d/e/1FAIpQLSdQnKvZuj_7LVd-Nqm3TQIoJ3hGPPq2WSUmgUltkvPvirCrTQ/viewform?usp=dialog) - Please fill out. We mostly want to see how confident you are before and after class. We will share these results with everyone (anonymized). +[First Class Survey](https://forms.gle/smj4wFAQufoHsG6h7) - Please fill out. We mostly want to see how confident you are before and after class. We will share these results with everyone (anonymized). ## Weekly Check In -[Weekly Check In Form](https://docs.google.com/forms/d/e/1FAIpQLSdx2WevmnwP1S2d9zhO_joHjdVbMkylVvEPjhd1WxLIbUaf8w/viewform?usp=sharing) - please fill out to let us know if you have any issues or want to share what you've learned. We look at the answers in aggregate and we anonymize responses (unless you want us to know). +[Weekly Check In Form](https://forms.gle/obwC5GYAA3iPHk5x7) - please fill out to let us know if you have any issues or want to share what you've learned. We look at the answers in aggregate and we anonymize responses (unless you want us to know). ## Posit Cloud Intro @@ -70,12 +68,12 @@ Here is a short video introducing you to the Posit Cloud interface. - Learning on the job is challenging - I will move at learner's pace; we are learning together. - Teach not for mastery, but teach for empowerment to learn effectively. - + We sometimes struggle with our data science in isolation, unaware that someone two doors down from us has gone through the same struggle. - - *We learn and work better with our peers.* - - *Know that if you have a question, other people will have it.* - - *Asking questions is our way of taking care of others.* +- *We learn and work better with our peers.* +- *Know that if you have a question, other people will have it.* +- *Asking questions is our way of taking care of others.* We ask you to follow [Participation Guidelines](https://hutchdatascience.org/communitystudios/guidelines/) and [Code of Conduct](https://github.com/fhdsl/coc). @@ -97,18 +95,14 @@ What it isn't: - Accreditation through an university or degree-granting program. - Requirements: -- Complete badge-required sections of the exercises for 3 out of 4 assignments. We'll cover this in class - - +- Complete for 3 out of 4 assignments. ## Available Course Formats -This course is available in multiple formats which allows you to take it in the way that best suites your needs. - -- The material for this course can be viewed without login requirement on this [website](https:///intro-sql-fh.netlify.app/). This format might be most appropriate for you if you rely on screen-reader technology. -- The material is also available to Fred Hutch Consortia students via Posit Cloud. -- Our courses are open source, you can find the [source material for this course on GitHub](https://github.com/fhdsl/intro_to_sql). +This course is available in multiple formats which allows you to take it in the way that best suites your needs. +- The material for this course can be viewed without login requirement on this [website](https:///intro-sql-fh.netlify.app/). This format might be most appropriate for you if you rely on screen-reader technology. +- The material is also available to Fred Hutch Consortia students via Posit Cloud. +- Our courses are open source, you can find the [source material for this course on GitHub](https://github.com/fhdsl/intro_to_sql). diff --git a/second-section-new-chapter.qmd b/second-section-new-chapter.qmd deleted file mode 100644 index d3b6401..0000000 --- a/second-section-new-chapter.qmd +++ /dev/null @@ -1,266 +0,0 @@ - -# A new chapter - -If you haven't yet read the getting started Wiki pages; [start there](https://www.ottrproject.org/getting_started.html). - -To see the rendered version of this chapter and the rest of the template, see here: https://jhudatascience.org/OTTR_Template/. - -Every chapter needs to start out with this chunk of code: - - -```{r, echo=FALSE, fig.alt='I can determine if a DMS plan is required using the table here: https://sharing.nih.gov/sites/default/files/List-of-Activity-Codes-Applicable-to-DMS-Policy.pdf', out.width = '100%', fig.align = 'center'} - -``` - - -```{r, include = FALSE} -ottrpal::set_knitr_image_path() -``` - -## Learning Objectives - -Every chapter also needs Learning objectives that will look like this: - -This chapter will cover: - -- {You can use https://tips.uark.edu/using-blooms-taxonomy/ to define some learning objectives here} -- {Another learning objective} - -## Libraries - -For this chapter, we'll need the following packages attached: - -*Remember to add [any additional packages you need to your course's own docker image](https://github.com/jhudsl/OTTR_Template/wiki/Using-Docker#starting-a-new-docker-image). - -```{r} -library(magrittr) -``` - -## Topic of Section - -You can write all your text in sections like this, using `##` to indicate a new header. you can use additional pound symbols to create lower levels of headers. - -See [here](https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf) for additional general information about how you can format text within R Markdown files. In addition, see [here](https://pandoc.org/MANUAL.html#pandocs-markdown) for more in depth and advanced options. - -### Subtopic - -Here's a subheading (using three pound symbols) and some text in this subsection! - -## Code examples - -You can demonstrate code like this: - -```{r} -output_dir <- file.path("resources", "code_output") -if (!dir.exists(output_dir)) { - dir.create(output_dir) -} -``` - -And make plots too: - -```{r} -hist_plot <- hist(iris$Sepal.Length) -``` - -You can also save these plots to file: - -```{r} -png(file.path(output_dir, "test_plot.png")) -hist_plot -dev.off() -``` - -## Image example - -How to include a Google slide. It's simplest to use the `ottrpal` package: - - -```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} -ottrpal::include_slide("https://docs.google.com/presentation/d/1YmwKdIy9BeQ3EShgZhvtb3MgR8P6iDX4DfFD65W_gdQ/edit#slide=id.gcc4fbee202_0_141") -``` - -But if you have the slide or some other image locally downloaded you can also use HTML like this: - -Major point!! example image - -## Video examples -You may also want to embed videos in your course. If alternatively, you just want to include a link you can do so like this: - -Check out this [link to a video](https://www.youtube.com/embed/VOCYL-FNbr0) using markdown syntax. - -### Using `knitr` - -To embed videos in your course, you can use `knitr::include_url()` like this: -Note that you should use `echo=FALSE` in the code chunk because we don't want the code part of this to show up. If you are unfamiliar with [how R Markdown code chunks work, read this](https://rmarkdown.rstudio.com/lesson-3.html). - - -```{r, echo=FALSE} -knitr::include_url("https://www.youtube.com/embed/VOCYL-FNbr0") -``` - -### Using HTML - - - -### Using `knitr` - -```{r, fig.align="center", echo=FALSE, out.width="100%"} -knitr::include_url("https://drive.google.com/file/d/1mm72K4V7fqpgAfWkr6b7HTZrc3f-T6AV/preview") -``` - -### Using HTML - - - -## Website Examples - -Yet again you can use a link to a website like so: - -[A Website](https://yihui.org) - -You might want to have users open a website in a new tab by default, especially if they need to reference both the course and a resource at once. - -[A Website](https://yihui.org){target="_blank"} - -Or, you can embed some websites. - -### Using `knitr` - -This works: - -```{r, fig.align="center", echo=FALSE} -knitr::include_url("https://yihui.org") -``` - - -### Using HTML - - - - -If you'd like the URL to show up in a new tab you can do this: - -``` -LinkedIn -``` - -## Citation examples - -We can put citations at the end of a sentence like this [@rmarkdown2021]. -Or multiple citations [@rmarkdown2021, @Xie2018]. - -but they need a ; separator [@rmarkdown2021; @Xie2018]. - -In text, we can put citations like this @rmarkdown2021. - -## Stylized boxes - -Occasionally, you might find it useful to emphasize a particular piece of information. To help you do so, we have provided css code and images (no need for you to worry about that!) to create the following stylized boxes. - -You can use these boxes in your course with either of two options: using HTML code or Pandoc syntax. - -### Using `rmarkdown` container syntax - -The `rmarkdown` package allows for a different syntax to be converted to the HTML that you just saw and also allows for conversion to LaTeX. See the [Bookdown](https://bookdown.org/yihui/rmarkdown-cookbook/custom-blocks.html) documentation for more information [@Xie2020]. Note that Bookdown uses Pandoc. - - -``` -::: {.notice} -Note using rmarkdown syntax. - -::: -``` - -::: {.notice} -Note using rmarkdown syntax. - -::: - -As an example you might do something like this: - -::: {.notice} -Please click on the subsection headers in the left hand -navigation bar (e.g., 2.1, 4.3) a second time to expand the -table of contents and enable the `scroll_highlight` feature -([see more](introduction.html#scroll-highlight)) -::: - - -### Using HTML - -To add a warning box like the following use: - -``` -
-Followed by the text you want inside -
-``` - -This will create the following: - -
- -Followed by the text you want inside - -
- -Here is a `
` box: - -
- -Note text - -
- -Here is a `
` box: - -
- -GitHub text - -
- - -Here is a `
` box: - -
- -dictionary text - -
- - -Here is a `
` box: - -
- -reflection text - -
- - -Here is a `
` box: - -
- -Work in Progress text - -
- - -## Dropdown summaries - -
You can hide additional information in a dropdown menu -Here's more words that are hidden. -
- -## Print out session info - -You should print out session info when you have code for [reproducibility purposes](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/managing-package-versions.html). - -```{r} -devtools::session_info() -``` - -[many links]: https://github.com/jhudsl/OTTR_Template diff --git a/week1-exercises-continued.qmd b/week1-exercises-continued.qmd new file mode 100644 index 0000000..fb943c8 --- /dev/null +++ b/week1-exercises-continued.qmd @@ -0,0 +1,93 @@ +--- +title: "Week 1 Exercise Continued" +format: html +editor: visual +--- + +## Week 1 Exercise Continued + +We'll first connect to the database: + +```{r} +#| context: setup +library(duckdb) +library(DBI) +con <- DBI::dbConnect(duckdb::duckdb(), + "data/synthea-smaller_breast_cancer.db") + +``` + +Here is an **entity-relationship diagram:** + +![](img/omop1.png) + +Each rectangle represent a table, and within each table are the columns (fields). I am only showing a subset of the columns based on what we have explored so far in class. The connecting lines shows that there are shared values between tables in those columns, which helps one navigate between tables: + +- In the "person" table, the elements of the column `person_id` overlaps with the elements of `person_id` column in in the table "procedure_occurrence". + +- In the "procedure_occurrence" table, the elements of the column `procedure_concept_id` overlaps with the elements of `concept_id` column in the table "concepts". + +We should consider to what degree the values overlap: + +- For each `person_id` in the "person" table, there may be duplicated `person_id`s in "procedure_occurrence" table, as a patient can have multiple procedures. This is a **one-to-many relationship**. + +- Multiple elements of `procedure_concept_id` in the "procedure_occurrence" table may correspond to a single element of `concept_id` in the "concept" table. This is a **many-to-one relationship**. + +- You can also have a **one-to-one relationship**. + +In class today you will start joining these tables via the columns that have shared elements! However, before we go wild with joining, it is often good to explore the relationship so that we know what to expect in our join. + +Let's explore `person_id` columns that is found in both "person" and "procedure_occurrence" tables. + +First, look at the number of elements in `person_id` from "person" table: + +```{sql} +#| connection: "con" +SELECT COUNT(-----) + FROM person +``` + +How about distinct elements in `person_id` from "person" table? + +```{sql} +#| connection: "con" +SELECT COUNT(DISTINCT ----) + FROM person +``` + +Okay, let's look at the "procedure_occurrence" table: what is thenumber of elements in `person_id` from "procedure_occurrence" table? + +```{sql connection="con"} + +``` + +How about distinct elements in `person_id` from "person" table? + +```{sql connection="con"} + + +``` + +Let's look at an example: query for the columns `procedure_occurrence_id`, `person_id`, and `procedure_concept_id` in the "procedure_occurrence" table, *where* the `person_id` has a value of 4. + +```{sql connection="con"} +SELECT ----- + FROM ------ + WHERE ----- +``` + +What can you say about the relationship of these two tables based on what you explored above? One-to-one, one-to-many, or many-to-one? + +If there is time, do the same analysis for "procedure_occurrence_table" and "concept" table using the columns `procedure_concept_id` and `concept_id`, respectively. + +```{sql connection="con"} + +``` + +```{sql connection="con"} + +``` + +```{sql connection="con"} + +``` diff --git a/week1.qmd b/week1.qmd index 23bca8a..cd9dca6 100644 --- a/week1.qmd +++ b/week1.qmd @@ -5,11 +5,10 @@ format: html ## Our Composable Database System -- Client: R/RStudio w/ SQL -- Database Engine: DuckDB -- Data Storage: single file in `data/` folder +- Client: R/RStudio w/ SQL +- Database Engine: DuckDB +- Data Storage: single file in `data/` folder - ## Connecting to our database To access the data, we need to create a database connection. We use `dbConnect()` from the `DBI` package to do this. The first argument specifies the Database engine (`duckdb()`), and the second provides the file location: `"data/data/GiBleed_5.3_1.1.duckdb"`. @@ -25,16 +24,16 @@ con <- DBI::dbConnect(duckdb::duckdb(), Once open, we can use `con` (our database connection) -:::{.callout-note} +::: callout-note ## Keep in Mind: SQL ignores letter case -These are the same to the database engine: +These are the same to the database engine: -``` +``` SELECT person_id FROM person; ``` -``` +``` select PERSON_ID FROM person; ``` @@ -57,21 +56,20 @@ We can get further information about the tables within our database using `DESCR DESCRIBE; ``` - We'll look at a few tables in our work: - - `person` - Contains personal & demographic data - - `procedure_occurrence` - procedures performed on patients and when they happened - - `condition_occurrence` - patient conditions (such as illnesses) and when they occurred - - `concept` - contains the specific information (names of concepts) that map into all three above tables - - We'll talk much more later about the relationships between these tables. +- `person` - Contains personal & demographic data +- `procedure_occurrence` - procedures performed on patients and when they happened +- `condition_occurrence` - patient conditions (such as illnesses) and when they occurred +- `concept` - contains the specific information (names of concepts) that map into all three above tables + +We'll talk much more later about the relationships between these tables. ## `SELECT` and `FROM` If we want to see the contents of a table, we can use `SELECT` and `FROM`. -``` +``` SELECT * # select all columns FROM person # from the person table LIMIT 10; # return only 10 rows @@ -82,11 +80,11 @@ SELECT * # select all columns SELECT * FROM person LIMIT 10; ``` -1. Why are there `birth_datetime` and the `month_of_birth`, `day_of_birth`, `year_of_birth` - aren't these redundant? +1. Why are there `birth_datetime` and the `month_of_birth`, `day_of_birth`, `year_of_birth` - aren't these redundant? ## Try it Out -Look at the first few rows of `procedure_occurrence`. +Look at the first few rows of `procedure_occurrence`. ```{sql} #| eval: FALSE @@ -94,13 +92,13 @@ Look at the first few rows of `procedure_occurrence`. SELECT * FROM ____ LIMIT 10; ``` -1. Why is there a `person_id` column in this table as well? +1. Why is there a `person_id` column in this table as well? ## `SELECT`ing a few columns in our table -We can use the `SELECT` clause to grab specific columns in our data. +We can use the `SELECT` clause to grab specific columns in our data. -``` +``` SELECT person_id, birth_datetime, gender_concept_id # Columns in our table FROM person; # Our Table ``` @@ -134,7 +132,6 @@ SELECT person_id, birth_datetime, gender_concept_id, ____, ____ FROM person; ``` - ## `WHERE` - filtering our table Adding `WHERE` to our SQL statement lets us add filtering to our query: @@ -171,12 +168,12 @@ SELECT person_id, gender_source_value, race_source_value Reminder: use single ('') quotes in your SQL statements to refer to values, not double quotes ("). -:::{.callout-note} +::: callout-note ### Quick Note For R users, notice the similarity of `select()` with `SELECT`. We can rewrite the above in `dplyr` code as: -```r +``` r person |> select(person_id, gender_source_value, race_source_value) ``` @@ -186,7 +183,7 @@ A lot of `dplyr` was inspired by SQL. In fact, there is a package called `dbplyr ## `COUNT` - how many rows? -Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for. +Sometimes you want to know the *size* of your result, not necessarily return the entire set of results. That is what `COUNT` is for. ```{sql} #| connection: "con" @@ -245,7 +242,7 @@ SELECT COUNT(DISTINCT --------------) One of the important properties of data in a relational database is that there are no *repeat rows* in the database. Each table that meets this restriction has what is called a *primary key*. -We can use `DESCRIBE` to get more information (the metadata) about a table. This gives us information about our tables. +We can use `DESCRIBE` to get more information (the metadata) about a table. This gives us information about our tables. ```{sql} #| connection: "con" @@ -269,19 +266,19 @@ What column is the same in both tables? That is a hint for what we'll cover next If you look at the `column_type` for one of the `DESCRIBE` statements above, you'll notice there are different data types: -- `INTEGER` -- `TIMESTAMP` -- `DATE` -- `VARCHAR` +- `INTEGER` +- `TIMESTAMP` +- `DATE` +- `VARCHAR` -Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates. +Each column of a database needs to be *typed*. The *data type* of a column determines what kinds of calculations or operations we can do on them. For example, we can do things like `date arithmetic` on `DATETIME` columns, asking the engine to calculate 5 days after the dates. You can see all of the [datatypes that are available in DuckDB here](https://duckdb.org/docs/sql/data_types/overview.html). ## Always close the connection -When we're done, it's best to close the connection with `dbDisconnect()`. +When we're done, it's best to close the connection with `dbDisconnect()`. ```{r} dbDisconnect(con) -``` \ No newline at end of file +``` diff --git a/week2-exercises.qmd b/week2-exercises.qmd index cea3d95..ee0a086 100644 --- a/week2-exercises.qmd +++ b/week2-exercises.qmd @@ -10,7 +10,6 @@ We'll first connect to the database: #| context: setup library(duckdb) library(DBI) -library(DiagrammeR) con <- DBI::dbConnect(duckdb::duckdb(), "data/synthea-smaller_breast_cancer.db") @@ -38,7 +37,13 @@ SELECT person_id, gender_source_value, birth_datetime ## `JOIN`s -`INNER JOIN` `person` and `concept` on `gender_concept_id` and `concept_id`. `SELECT` `person_id` and `concept_name` from the appropriate tables. +`INNER JOIN` `person` and `concept` on `gender_concept_id` and `concept_id`, respectively. `SELECT` `person_id` and `concept_name` from the appropriate tables. + +We update our entity-relationship diagram as follows to show the link between `gender_concept_id` in `person` and `concept_id` in `concept`: + +![](img/omop2.png) + +This shows that rows of the `concept` table contains `concept_id`s for information about `gender_concept_id` from `person` *and* `procedure_concept_id` from `procedure_occurrence`! The `concept` table is like a lookup table of anything that has a `concept_id`. ```{sql} #| connection: "con" @@ -49,15 +54,7 @@ SELECT * LIMIT 20; ``` -```{sql} -#| connection: "con" -SELECT person.person_id, concept.concept_name - FROM person - INNER JOIN concept - ON person.gender_concept_id = concept.concept_id -``` - -`INNER JOIN` 3 tables: `procedure_occurrence`, `person`, and `concept`, `ON` the appropriate keys. Select `person_id`, `birth_datetime`, `concept_name`, and `procedure_date` from the appropriate tables. Use table references and aliases to make the column names unambiguous. +`INNER JOIN` 3 tables: `person`, `procedure_occurrence`, and `concept`, `ON` the appropriate keys. (See class notes for similar examples.) Select `person_id`, `birth_datetime`, `concept_name`, and `procedure_date` from the appropriate tables. Use table references and aliases to make the column names unambiguous. ```{sql} #| connection: "con" @@ -70,18 +67,7 @@ SELECT -----, ------, -----, ----- ``` -```{sql} -#| connection: "con" -SELECT p.person_id, p.birth_datetime, c.concept_name, po.procedure_date - FROM person AS p - INNER JOIN procedure_occurrence AS po - ON p.person_id = po.person_id - INNER JOIN concept AS c - ON po.procedure_concept_id = c.concept_id - -``` - -Modify the above query to select only those procedures done after the yeqr 2000. +Modify the above query to select only those procedures done after the year 2000 in the code chunk below. You can extract the Year part from a date column with `date_part('YEAR', ------)` @@ -96,22 +82,11 @@ SELECT -----, ------, -----, ----- WHERE ----- - ------- ``` -```{sql} -#| connection: "con" -SELECT p.person_id, p.birth_datetime, c.concept_name, po.procedure_date, po.procedure_concept_id - FROM person AS p - INNER JOIN procedure_occurrence AS po - ON p.person_id = po.person_id - INNER JOIN concept AS c - ON po.procedure_concept_id = c.concept_id - WHERE date_part('YEAR',po.procedure_date) > 2000 -``` - ## Boolean Logic Count the number of cases for `procedure_occurrence` with the following criteria: -``` +``` procedure_concept_id = 4230911 AND date_part('YEAR', procedure_datetime) > 2000 ``` @@ -123,6 +98,7 @@ SELECT COUNT(*) ---------- ---------- ``` + Try it out with `OR` instead. Was your result bigger or smaller than the `AND`? ```{sql} @@ -135,7 +111,7 @@ SELECT COUNT(*) ## On Your Own -Try constructing a query of your own that uses a `JOIN`. If you want to go further, add a `WHERE` as well. +Try constructing a query of your own that uses a `JOIN`. If you want to go further, add a `WHERE` as well. Consider other tables in the database! ```{sql} #| connection: "con" diff --git a/week2.qmd b/week2.qmd index f998f76..e882fab 100644 --- a/week2.qmd +++ b/week2.qmd @@ -15,7 +15,6 @@ con <- DBI::dbConnect(duckdb::duckdb(), "data/GiBleed_5.3_1.1.duckdb") ``` - ## Table References In single table queries, it is usually unambiguous to the query engine which column and which table you need to query. @@ -24,13 +23,13 @@ However, when you involve multiple tables, it is important to know how to refer For example, the `procedure_occurrence` table has a `person_id` column as well. If we want to use this specific column in this table, we can use the `.` (dot) notation: -``` +``` procedure_occurrence.person_id ``` If we wanted the `person_id` column in `person` we can use this: -``` +``` person.person_id ``` @@ -79,106 +78,125 @@ SELECT COUNT(person_id) AS person_count WHERE year_of_birth < 2000; ``` -Now that we are going to use `JOIN`s, we will be using aliases and table references a lot. +We will be using aliases and table references a lot when we start `JOIN`ing tables. + +## Entity-relationship diagrams + +Joining tables require understanding the relationship between tables in a database. This is often visualized via an **entity-relationship diagram:** + +![](img/omop1.png) + +Each rectangle represent a table, and within each table are the columns (fields). I am only showing a subset of the columns based on what we have explored so far in class. The connecting lines shows that there are shared values between tables in those columns, which helps one navigate between tables: + +- In the `person` table, the elements of the column `person_id` overlaps with the elements of `person_id` column in in the table `procedure_occurrence`. + +- In the `procedure_occurrence` table, the elements of the column `procedure_concept_id` overlaps with the elements of `concept_id` column in the table `concepts`. + +We should consider to what degree the values overlap: + +- For each `person_id` in the `person` table, there may be duplicated `person_id`s in `procedure_occurrence` table, as a patient can have multiple procedures. This is a **one-to-many relationship**. + +- Multiple elements of `procedure_concept_id` in the `procedure_occurrence` table may correspond to a single element of `concept_id` in the "concept" table. This is a **many-to-one relationship**. + +- You can also have a **one-to-one relationship**. + +The database we\'ve been using has been rigorously modeled using a data model called [OMOP CDM (Common Data Model)](https://ohdsi.github.io/CommonDataModel/index.html). OMOP is short for Observational Medical Outcomes Partnership, and it is designed to be a database format that standardizes data from systems into a format that can be combined with other systems to compare health outcomes across organizations. The full OMOP entity relationship diagram can be [found here](https://ohdsi.github.io/CommonDataModel/cdm54erd.html). + +Now, let's join some tables. ## `JOIN` -We use the `JOIN` clause when we want to combine information from two tables. Here we are going to combine information from two tables: `procedure_occurrence` and `concept`. +We use the `JOIN` clause when we want to combine information from two tables. Here we are going to combine information from two tables: `person` and `procedure_occurrence`. -To set the stage, let's show two tables, `x` and `y`. We want to join them by the keys, which are represented by colored boxes in both of the tables. +To set the stage, let's show two tables, `x` and `y`. We want to join them by the keys, which are represented by colored boxes in both of the tables. -Note that table `x` has a key ("3") that isn't in table `y`, and that table `y` has a key ("4") that isn't in table `x`. +Note that table `x` has a key ("3") that isn't in table `y`, and that table `y` has a key ("4") that isn't in table `x`. ![](img/original-dfs.png) -We are going to explore `INNER JOIN` first. In an `INNER JOIN`, we match up our primary key for our table on the foreign key for another table. In this case, we only retain rows that have keys that exist in both the `x` and `y` tables. We drop all rows that don't have matches in both tables. +We are going to explore `INNER JOIN` first. In an `INNER JOIN`, we pick out a column from each table in which its elements are going to be matched. In this case, we only retain rows that have elements that exist in both the `x` and `y` tables. We drop all rows that don't have matches in both tables. -![](img/inner-join.gif) -There are other types of joins when we want to retain information from the `x` table or the `y` table, or both. +![](img/inner-join.gif)There are other types of joins when we want to retain information from the `x` table or the `y` table, or both. ## `INNER JOIN` syntax -Here's an example where we are joining `procedure_occurrence` with `concept`: +Here's an example where we are joining `person` with `procedure_occurrence`: ```{sql} #| connection: "con" -SELECT procedure_occurrence.person_id, concept.concept_name - FROM procedure_occurrence - INNER JOIN concept - ON procedure_occurrence.procedure_concept_id = concept.concept_id +SELECT person.person_id, procedure_occurrence.procedure_occurrence_id + FROM person + INNER JOIN procedure_occurrence + ON person.person_id = procedure_occurrence.person_id ``` What's going on here? The magic happens with this clause, which we use to specify the two tables we need to join. -``` -FROM procedure_occurrence - INNER JOIN concept +``` +FROM person + INNER JOIN procedure_occurrence ``` -The last thing to note is the `ON` statement. These are the conditions by which we merge rows. Note we are taking one column in `procedure.occurrence`, the `procedure_concept_id`, and matching the rows up with those rows in `concept` +The last thing to note is the `ON` statement. These are the conditions by which we merge rows. We are taking one column in `person`, the `person_id`, and matching the rows up with those rows in `procedure_occurrence`'s own `person_id` column: -``` -ON procedure_occurrence.procedure_concept_id = concept.concept_id +``` +ON person.person_id = procedure_occurrence.person_id ``` -```{sql} -#| connection: "con" -SELECT procedure_occurrence.person_id, concept.concept_name - FROM procedure_occurrence - INNER JOIN concept - ON procedure_occurrence.procedure_concept_id = concept.concept_id -``` - -Here is the same query using aliases. We use `po` as an alias for `procedure_occurrence` and `c` as an alias for `concept`. You can see it is a little more compact. +Here is the same query using aliases. We use `p` as an alias for `person` and `po` as an alias for `procedure_occurrence`. You can see it is a little more compact. ```{sql} #| connection: "con" -#| output.var: pro -SELECT po.person_id, c.concept_name - FROM procedure_occurrence as po - INNER JOIN concept as c - ON po.procedure_concept_id = c.concept_id; +SELECT p.person_id, po.procedure_occurrence_id + FROM person as p + INNER JOIN procedure_occurrence as po + ON p.person_id = po.person_id ``` ## `LEFT JOIN` -::: {.callout-note} ## Jargon alert The table to the **left** of the `JOIN` clause is called the **left table**, and the table to the **right** of the `JOIN` clause is known as the **right table**. This will become more important as we explore the different join types. -``` +``` FROM procedure_occurrence INNER JOIN concept ^^Left Table ^^Right Table ``` -::: - -What if we want to retain all of the rows in the `procedure_occurrence` table, even if there are no matches in the `concept` table? We can use a `LEFT JOIN` to do that. +What if we want to retain all of the rows in the `procedure_occurrence` table, even if there are no matches in the `concept` table? We can use a `LEFT JOIN` to do that. ![](img/left-join.gif) If a row exists in the left table, but not the right table, it will be replicated in the joined table, but have rows with `NULL` columns from the right table. -I tried to find some examples where `LEFT JOIN`ed tables were different than `INNER JOIN`ed tables, but couldn't find one good example in our tables. Here is another example: +Here is another example: + +![](img/Slide4.jpeg)We can see the difference between a `INNER JOIN` and `LEFT JOIN` by counting the number of rows kept after joining: -![](img/Slide4.jpeg) -Nevertheless, here is an example of a `LEFT JOIN`: +```{sql} +#| connection: "con" +SELECT COUNT (*) + FROM person as p + INNER JOIN procedure_occurrence as po + ON p.person_id = po.person_id +``` ```{sql} #| connection: "con" -SELECT c.concept_name, po.person_id, c.domain_id - FROM concept as c - LEFT JOIN procedure_occurrence AS po - ON po.procedure_concept_id = c.concept_id - WHERE c.domain_id = 'Procedure' +SELECT COUNT (*) + FROM person as p + LEFT JOIN procedure_occurrence as po + ON p.person_id = po.person_id ``` +This suggests that there are some unique `person_id`s in `person` table not found in the `person_id` of `procedure_occurrence` table. + ## Other kinds of `JOIN`s -- The `RIGHT JOIN` is identical to `LEFT JOIN`, except that the rows preserved are from the *right* table. -- The `FULL JOIN` retains all rows in both tables, regardless if there is a key match. -- `ANTI JOIN` is helpful to find all of the keys that are in the *left* table, but not the *right* table +- The `RIGHT JOIN` is identical to `LEFT JOIN`, except that the rows preserved are from the *right* table. +- The `FULL JOIN` retains all rows in both tables, regardless if there is a key match. +- `ANTI JOIN` is helpful to find all of the keys that are in the *left* table, but not the *right* table ## Multiple `JOIN`s with Multiple Tables @@ -186,7 +204,7 @@ We can have multiple joins by thinking them as a sequential operation of one joi ```{sql} #| connection: "con" -SELECT p.gender_source_value, c.concept_name, po.procedure_date +SELECT p.person_id, po.procedure_occurrence_id, c.concept_name FROM person AS p INNER JOIN procedure_occurrence AS po ON p.person_id = po.person_id @@ -197,58 +215,60 @@ SELECT p.gender_source_value, c.concept_name, po.procedure_date The way I think of these multi-table joins is to decompose them into two joins: -1. We first `INNER JOIN` `person` and `procedure_occurrence`, to produce an output table -2. We take this output table and `INNER JOIN` it with `concept`. +1. We first `INNER JOIN` `person` and `procedure_occurrence`, to produce an output table +2. We take this output table and `INNER JOIN` it with `concept`. Notice that both of these `JOIN`s have separate `ON` statements. For the first join, we have: -``` +``` INNER JOIN procedure_occurrence AS po ON p.person_id = po.person_id ``` For the second `JOIN`, we have: -``` +``` INNER JOIN concept AS c ON po.procedure_concept_id = c.concept_id ``` And that gives us the final table, which takes variables from all three tables. -One thing to keep in mind is that `JOIN`s are not necessarily commutative; that is, the order of joins can matter. This is because we may drop or preserve rows depending on the `JOIN`. +One thing to keep in mind is that `JOIN`s are not necessarily commutative; that is, the order of joins can matter. This is because we may drop or preserve rows depending on the `JOIN`. -For combining `INNER JOIN`s, we are looking for the subset of keys that exist in each table, so join order doesn't matter. But for combining `LEFT JOIN`s and `RIGHT JOINS`, order *can* matter. +For combining `INNER JOIN`s, we are looking for the subset of keys that exist in each table, so join order doesn't matter. But for combining `LEFT JOIN`s and `RIGHT JOINS`, order *can* matter. It's really important to check intermediate output and make sure that you are retaining the rows that you need in the final output. For example, I'd try the first join first and see that it contains the rows that I need before adding the second join. ## Using `JOIN` with `WHERE` -Where we really start to cook with gas is when we combine `JOIN` with `WHERE`. Here, we're joining `procedure_occurrence` and `concept`, with an additional `WHERE` where we only want those rows that have the `concept_name` of 'Subcutaneous immunotherapy`: +Where we really start to cook with gas is when we combine `JOIN` with `WHERE`. Let's add an additional `WHERE` where we only want those rows that have the `concept_name` of 'Subcutaneous immunotherapy\`: ```{sql} #| connection: "con" -SELECT po.person_id, c.concept_name - FROM procedure_occurrence as po - INNER JOIN concept as c - ON po.procedure_concept_id = c.concept_id - WHERE c.concept_name = 'Subcutaneous immunotherapy'; +SELECT p.person_id, po.procedure_occurrence_id, c.concept_name + FROM person AS p + INNER JOIN procedure_occurrence AS po + ON p.person_id = po.person_id + INNER JOIN concept AS c + ON po.procedure_concept_id = c.concept_id + WHERE c.concept_name = 'Subcutaneous immunotherapy'; + ``` -Here is a triple join query with an additional filter. You can see why aliases are useful: +Or keeping rows where the year of birth is before 1980: ```{sql} #| connection: "con" -SELECT po.person_id, c.concept_name, p.birth_datetime - FROM procedure_occurrence as po - INNER JOIN concept as c - ON po.procedure_concept_id = c.concept_id - INNER JOIN person as p - ON po.person_id = p.person_id +SELECT p.person_id, p.year_of_birth, po.procedure_occurrence_id, c.concept_name + FROM person AS p + INNER JOIN procedure_occurrence AS po + ON p.person_id = po.person_id + INNER JOIN concept AS c + ON po.procedure_concept_id = c.concept_id WHERE p.year_of_birth < 1980; ``` -::: {.callout-note} ## `WHERE` vs `ON` You will see variations of SQL statements that eliminate `JOIN` and `ON` entirely, putting everything in `WHERE`: @@ -263,13 +283,12 @@ SELECT po.person_id, c.concept_name ``` I'm not the biggest fan of this, because it is often not clear what is a filtering clause and what is a joining clause, so I prefer to use `JOIN`/`ON` with a `WHERE`. -::: ## Boolean Logic: `AND` versus `OR` -Revisiting `WHERE`, we can combine conditions with `AND` or `OR`. +Revisiting `WHERE`, we can combine conditions with `AND` or `OR`. -`AND` is always going to be more restrictive than `OR`, because our rows must meet two conditions. +`AND` is always going to be more restrictive than `OR`, because our rows must meet two conditions. ```{sql} #| connection: "con" @@ -279,7 +298,7 @@ SELECT COUNT(*) AND gender_source_value = 'M' ``` -On the other hand `OR` is more permissing than `AND`, because our rows must meet only one of the conditions. +On the other hand `OR` is more permissive than `AND`, because our rows must meet only one of the conditions. ```{sql} #| connection: "con" @@ -288,7 +307,8 @@ SELECT COUNT(*) WHERE year_of_birth < 1980 OR gender_source_value = 'M' ``` -There is also `NOT`, where one condition must be true, and the other must be false. + +There is also `NOT`, where one condition must be true, and the other must be false. ```{sql} #| connection: "con" @@ -304,87 +324,76 @@ SELECT COUNT(*) ```{sql} #| connection: "con" -SELECT po.person_id, c.concept_name, po.procedure_date - FROM procedure_occurrence as po - INNER JOIN concept as c - ON po.procedure_concept_id = c.concept_id - ORDER BY po.procedure_date; +SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date + FROM person as p + INNER JOIN procedure_occurrence as po + ON p.person_id = po.person_id + ORDER BY p.person_id; ``` -We can `ORDER BY` multiple columns. Column order is important. Try changing the order of the columns in the query below. How is it different? +Once we sorted by `person_id`, we see that for every unique `person_id`, there can be multiple procedures! This suggests that there is a **one-to-many relationship** between `person` and `procedure_occurrence` tables. -```{sql} -#| connection: "con" -SELECT po.person_id, c.concept_name, po.procedure_date - FROM procedure_occurrence as po - INNER JOIN concept as c - ON po.procedure_concept_id = c.concept_id - ORDER BY po.person_id, po.procedure_date; -``` +## Try it Out -## Try it OUt - -Try ordering by `po.patient_id`: +We can `ORDER BY` multiple columns at once. Try ordering by `p.patient_id` and `po.procedure_date`: ```{sql} #| connection: "con" -SELECT po.person_id, c.concept_name, po.procedure_date - FROM procedure_occurrence AS po - INNER JOIN concept AS c - ON po.procedure_concept_id = c.concept_id - ORDER BY po.procedure_date; +SELECT p.person_id, po.procedure_occurrence_id, po.procedure_date + FROM person as p + INNER JOIN procedure_occurrence as po + ON p.person_id = po.person_id + ORDER BY ----, ---- ``` ## Transactions and Inserting Data -So far, we've only queried data, but not added data to databases. +So far, we've only queried data, but not added data to databases. + +As we've stated before, DuckDB is an Analytical database, not a transactional one. That means it prioritizes reading from data tables rather than inserting into them. Transactional databases, on the other hand, can handle multiple inserts from multiple users at once. They are made for *concurrent* transactions. -As we've stated before, DuckDB is an Analytical database, not a transactional one. That means it prioritizes reading from data tables rather than inserting into them. Transactional databases, on the other hand, can handle multiple inserts from multiple users at once. They are made for *concurrent* transactions. - Here is an example of what is called the *Data Definition Language* for our tables: -```sql +``` sql CREATE TABLE @cdmDatabaseSchema.PERSON ( - person_id integer NOT NULL, - gender_concept_id integer NOT NULL, - year_of_birth integer NOT NULL, - month_of_birth integer NULL, - day_of_birth integer NULL, - birth_datetime TIMESTAMP NULL, - race_concept_id integer NOT NULL, - ethnicity_concept_id integer NOT NULL, - location_id integer NULL, - provider_id integer NULL, - care_site_id integer NULL, - person_source_value varchar(50) NULL, - gender_source_value varchar(50) NULL, - gender_source_concept_id integer NULL, - race_source_value varchar(50) NULL, - race_source_concept_id integer NULL, - ethnicity_source_value varchar(50) NULL, - ethnicity_source_concept_id integer NULL ); -``` - - -When we add rows into a database, we need to be aware of the *constraints* of the database. They exist to maintain the *integrity* of a database. - -We've encountered one constraint: database fields need to be *typed*. For example, id keys are usually `INTEGER`. Names are often `VARCHAR`. - -One contraint is the requirement for *unique keys* for each row. We cannot add a new row with a previous -key value. - -- `NOT NULL` -- `UNIQUE` -- `PRIMARY KEY` - `NOT NULL` + `UNIQUE` -- `FOREIGN KEY` - value must exist as a key in another table -- `CHECK` - check the data type and conditions. One example would be our data shouldn't be before 1900. -- `DEFAULT` - default values. + person_id integer NOT NULL, + gender_concept_id integer NOT NULL, + year_of_birth integer NOT NULL, + month_of_birth integer NULL, + day_of_birth integer NULL, + birth_datetime TIMESTAMP NULL, + race_concept_id integer NOT NULL, + ethnicity_concept_id integer NOT NULL, + location_id integer NULL, + provider_id integer NULL, + care_site_id integer NULL, + person_source_value varchar(50) NULL, + gender_source_value varchar(50) NULL, + gender_source_concept_id integer NULL, + race_source_value varchar(50) NULL, + race_source_concept_id integer NULL, + ethnicity_source_value varchar(50) NULL, + ethnicity_source_concept_id integer NULL ); +``` + +When we add rows into a database, we need to be aware of the *constraints* of the database. They exist to maintain the *integrity* of a database. + +We've encountered one constraint: database fields need to be *typed*. For example, id keys are usually `INTEGER`. Names are often `VARCHAR`. + +One contraint is the requirement for *unique keys* for each row. We cannot add a new row with a previous key value. + +- `NOT NULL` +- `UNIQUE` +- `PRIMARY KEY` - `NOT NULL` + `UNIQUE` +- `FOREIGN KEY` - value must exist as a key in another table +- `CHECK` - check the data type and conditions. One example would be our data shouldn't be before 1900. +- `DEFAULT` - default values. The most important ones to know about are `PRIMARY KEY` and `FOREIGN KEY`. `PRIMARY KEY` forces the database to create new rows with an automatically incremented id. When we create tables in our database, we need to specify which column is a `PRIMARY KEY`: -```sql +``` sql CREATE TABLE person ( person_id INTEGER PRIMARY KEY ) @@ -392,7 +401,7 @@ CREATE TABLE person ( `FOREIGN KEY` involves two or more tables. If a column is declared a `FOREIGN KEY`, then that key value must *exist* in a REFERENCE table. Here our two reference tables are `person` and `procedure_occurrence`. -```sql +``` sql CREATE TABLE procedure_occurrence { procedure_occurrence_id PRIMARY KEY, person_id INTEGER REFERENCES person(person_id) @@ -408,7 +417,7 @@ You can see an example of constraints for our database here: +``` +a. For example, when is procedure information collected? +b. Do patients have multiple procedures? (Cardinality) -Of this, steps 1 and 2 are the most difficult and take the most time. They require the designer to interview users of the data and those who collect the data to reflect the *business processes*. These two steps are called the **Data Modeling** steps. +```{=html} + +``` +2. You need to group like data with like (normalization) -These processes are essential if you are designing a **transactional database** that is collecting data from multiple sources (such as clinicians at time of care) and is updated multiple times a second. For example, bank databases have a rigorous design. +```{=html} + +``` +a. Data that is dependent on a primary key should stay together +b. For example, `person` should contain information of a patient such as demographics, but not individual `procedure_concept_ids`. -If you want to read more about the data model we're using, I've written up a short bit here: [OMOP Data Model](miscellaneous.html#the-omop-data-model). +```{=html} + +``` +3. You need to have an automated process to add data to the database (Extract Transfer Load, or ETL). +4. Search processes must be optimized for common operations (indexing) + +Of this, steps 1 and 2 are the most difficult and take the most time. They require the designer to interview users of the data and those who collect the data to reflect the *business processes*. These two steps are called the **Data Modeling** steps. +These processes are essential if you are designing a **transactional database** that is collecting data from multiple sources (such as clinicians at time of care) and is updated multiple times a second. For example, bank databases have a rigorous design. + +If you want to read more about the data model we're using, I've written up a short bit here: [OMOP Data Model](miscellaneous.html#the-omop-data-model). ## Database Administration Maintaining a database is also known as **database administration**. Database Admins are responsible for the following: -1. Making sure that the data maintains its integrity -2. Ensuring that common queries are optimized for fast loading -3. General upkeep and optimization. Oftentimes, if multiple people are accessing the data at once, the data may be distributed among multiple machines (load balancing). -4. Security. We don't want the wrong people accessing the data. +1. Making sure that the data maintains its integrity +2. Ensuring that common queries are optimized for fast loading +3. General upkeep and optimization. Oftentimes, if multiple people are accessing the data at once, the data may be distributed among multiple machines (load balancing). +4. Security. We don't want the wrong people accessing the data. -Being a good admin does not start from scratch. You can't be a top-tier admin straight out of school. There are a lot of things DB admins learn, but a lot of the optimization happens from experience with managing the data. +Being a good admin does not start from scratch. You can't be a top-tier admin straight out of school. There are a lot of things DB admins learn, but a lot of the optimization happens from experience with managing the data. -Respect your DB Admin and know that they know a lot about how to optimize your queries. +Respect your DB Admin and know that they know a lot about how to optimize your queries. ## Always close the connection -When we're done, it's best to close the connection with `dbDisconnect()`. +When we're done, it's best to close the connection with `dbDisconnect()`. ```{r} dbDisconnect(con)