statistical_notes/complex_reshaping_example.qmd at main · clayford/statistical_notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
---
title: "Complicated reshaping example"
author: "Clay Ford"
date: "February 15, 2024"
format:
  html:
    embed-resources: true
---

## The task

The data set in question had six different sets of variables that needed to be reshaped into "long" format.  Each set was collected over five waves.

```{r echo=FALSE}
library(haven)
d <- read_sav("SupportandERI.sav")
# reorder columns
vars <- c("MEI_affirm", "w2MEI_affirm", "w3MEI_affirm", "w4MEI_affirm",
          "w5MEI_affirm", "MEI_explor", "w2MEI_explor", "w3MEI_explor",
          "w4MEI_explor", "w5MEI_explor")
d <- d[,c(names(d)[1:6], vars, names(d)[17:36])]
names(d)[c(7, 12)] <- c("w1MEI_affirm", "w1MEI_explor")
```

For example, here's one set:

```{r}
d[1:2,7:11]
```

There are five more sets like that in the data. For each I needed to reshape such that there was a column for time and a column identifying the type of measure. Something like this...

```
time    affirm
1       3
2       2
3       2.33
4       3
5       3
1       4.33
2       4
...
```

## The original approach

My first attempt was pretty verbose but it worked. You can tell my mastery of `pivot_longer()` did not exceed much beyond the basics. Yes, I knew how to use the `names_transform` argument to extract the numbers from the column labels, but that was about it.

```{r}
library(tidyr)

# reshape affirm
d1 <- pivot_longer(d[,1:11], -c(1:6), names_to = "time",
                   values_to = "affirm",
                   names_transform = readr::parse_number)

# reshape explor
d2 <- pivot_longer(d[,c(1, 12:16)], -1, names_to = "time",
                   values_to = "explor",
                   names_transform = readr::parse_number)


# reshape appSUM
d3 <- pivot_longer(d[,c(1, 17:21)], -1, names_to = "time",
                   values_to = "appSUM",
                   names_transform = readr::parse_number)

# reshape emoSUM
d4 <- pivot_longer(d[,c(1, 22:26)], -1, names_to = "time",
                   values_to = "emoSUM",
                   names_transform = readr::parse_number)

# reshape infoSUM
d5 <- pivot_longer(d[,c(1, 27:31)], -1, names_to = "time",
                   values_to = "infoSUM",
                   names_transform = readr::parse_number)

# reshape racematch
d6 <- pivot_longer(d[,c(1, 32:36)], -1, names_to = "time",
                   values_to = "racematch",
                   names_transform = readr::parse_number)

# merge reshaped dataframes into one
d_long1 <- merge(d1, d2, by = c("P_ID", "time")) |>
  merge(d3, by = c("P_ID", "time")) |>
  merge(d4, by = c("P_ID", "time")) |>
  merge(d5, by = c("P_ID", "time")) |>
  merge(d6, by = c("P_ID", "time"))

head(d_long1, n = 5)
```

This did the trick and I was able to move on, but it didn't sit right with me. I thought I remembered Jacob demonstrating in [his data wrangling workshop](https://virginia.box.com/s/u3ojzf0c0xuxe13yohk8noi4di8zb4ri) how to reshape datasets similar to this using just _one call_ to `pivot_longer()`. So today I decided to revisit this code and see what I could do with it.

## The elegant approach

After some fiddling around with the `pivot_longer()` function and reading the {tidyr} "Pivoting" vignette, I was finally able to implement this in a much more elegant fashion. (I should have referred to Jacob's workshop materials, but I didn't have them handy and decided to see if I could figure it out on my own. Of course, once I did it, I then had to go download his materials and check if this approach is what he demonstrated, and it is!)

```{r}
d_long2 <- pivot_longer(d, cols = !c(1:6),
                       names_to = c("time", ".value"),
                       names_pattern = "(w[1-5])(.+)",
                       names_transform = list(time = readr::parse_number))
head(d_long2, n = 5)
```

This is for my benefit so I remember how this works:

- `cols = !c(1:6)` says to reshape all columns but the first six
- `names_to = c("time", ".value")` uses the keyword `.value` which "indicates that the corresponding component of the column name defines the name of the output column containing the cell values" (from the help page).
- `names_pattern = "(w[1-5])(.+)"` defines the pattern of the column names: "the letter w followed by a number 1-5, `(w[1-5])`, and then everything else, `(.+)`.
- `names_transform = list(time = readr::parse_number)` says to transform the time column to a number using the `parse_number` function.

In the grand scheme, I'm not sure this makes a difference. Most people don't care about data wrangling but rather the analysis. Although my first attempt was inefficient, it was easy for me to verify that each set of columns was successfully reshaped. I don't know if I'll ever get to a place where it becomes routine for me to reshape data like this casually using one call to `pivot_longer()`. But I thought it might be of interest to see what `pivot_longer()` is capable of.