nyc-fires/live_code.Rmd at master · aedobbyn/nyc-fires · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
---
title: "`drake` ~ live coding sesh ~"
author: "Amanda Dobbyn"
date: "February, 2019"
output: md_document
---

```{r setup, include=FALSE, warning=FALSE}
knitr::opts_chunk$set(eval = FALSE)
knitr::opts_knit$set(eval = FALSE)
```

```{r}
# Will not be necessary to set in the upcoming release
pkgconfig::set_config("drake::strings_in_dots" = "literals")

# Grab all funs
source(here::here("R", "didnt_start_it.R"))
```


### Burner account time

Let's define a file path to dump the tweets we collect into.

```{r}
burner_path <- here("data", "raw", "burn.csv") # Same as "<working_dir>/data/raw/burn.csv"

# Delete the file in burner_path if it exists already so we start from scratch
if (file_exists(burner_path)) file_delete(burner_path)
```


Now we'll create our first draft of the plan:

```{r}
burner_plan <-
  drake_plan(
    # Reads in file from burner_path if file exists, otherwise pulls in seed tweets
    seed_burn = get_tweets(
      user = burner_handle,
      input_path = burner_path
    ),

    full_burn = get_tweets(
        tbl = seed_burn,
        user = burner_handle,
        output_path = burner_path # Outputs to burner_path
      )
  )
```

Every time `burner_plan` is run, new any tweets collected will be stored in `burner_path` and be read into the next iteration of `seed_burn`.

### Quick note on files

We're not using `drake`'s [`file_in`](https://ropensci.github.io/drake/reference/file_in.html) and [`file_out`](https://ropenscilabs.github.io/drake-manual/vis.html#output-files) functions, which allow `drake` to recognize and track the file. That lets us see when the targets associated with the file (both upstream and downstream of it) are invalidated.

### Test out v1 of our plan

Since `seed_burn` hasn't changed and the code to generate the targets hasn't changed, every time after the first time we run this plan `drake` will let us know that our targets are up to date and not re-run anything.

```{r}
# Take a look at the dataframe that represents our plan
burner_plan

# Save the config in an object we can look at
burner_config <- drake_config(burner_plan)

# Remove targets if they were already built
clean()

# Both seed_burn and full_burn should be outdated
vis_drake_graph(burner_config)

# Make the plan
make(burner_plan, verbose = 4)

# Everything should already be up to date
outdated(burner_config)

# Which is reflected in our graph
vis_drake_graph(burner_config)

# Can show just targets if things are too crowded
vis_drake_graph(burner_config, targets_only = TRUE)
```

We can now load these targets into our working environment.

```{r}
# They won't be found if we don't loadd() them
seed_burn

loadd(seed_burn)
seed_burn
```


Every subsequent time we re-`make` the plan, `drake` should tell us we don't need to do anything.

```{r}
# Unloads (removes) the global variable seed burn we've loadd'd to prevent conflicts
make(burner_plan)
# Not in our env anymore
seed_burn

make(burner_plan)
make(burner_plan)
```

If we `clean()`, however, we'll remake the plan from scratch.

```{r}
clean()

# Targets we built are gone from the cache
loadd(seed_burn)

# Targets are outdated now
vis_drake_graph(burner_config, targets_only = TRUE)

# Make the plan again.
# Since we saved the previous output of full_burn to a file, this time we read that in for our seed_burn
make(burner_plan)

# And now everything is up to date again
vis_drake_graph(burner_config, targets_only = TRUE)
```


### Changing Code

One way we can guarantee that `drake` will re-`make` a target is if some part of the code used to generate that target changes.

Right now, nothing outdated becuase we just ran `make(burner_plan)`.

```{r}
outdated(burner_config)
```

Let's modify the code to a function called by `get_tweets`. Note that `drake` recognizes that targets that incorporate this function at some point in the pipeline are out of date, even though this function isn't called in the plan directly.

Our original function, `there_are_new_tweets` which checks if the specified user has tweeted anything since the most recent tweet in our `tbl` argument:

```{r}
there_are_new_tweets <- function(tbl,
                                 user = firewire_handle,
                                 verbose = TRUE) {
  latest_dt <-
    tbl %>%
    arrange(desc(created_at)) %>%
    slice(1) %>%
    pull(created_at)

  if (verbose) message("Searching for new tweets.")

  new <- get_seed_tweets(user = user, n_tweets = 1)

  if (max(new$created_at) <= latest_dt) {
    if (verbose) message("No new tweets to pull.")
    FALSE
  } else {
    TRUE
  }
}
```


We'll change the message to include a smiley emoji `emo::ji('smile')` when `there_are_new_tweets` is called.

```{r}
there_are_new_tweets <- function(tbl,
                                 user = firewire_handle,
                                 verbose = TRUE) {
  latest_dt <-
    tbl %>%
    arrange(desc(created_at)) %>%
    slice(1) %>%
    pull(created_at)

  if (verbose) message(glue("Searching for new tweets! {emo::ji('smile')}"))

  new <- get_seed_tweets(user = user, n_tweets = 1)

  if (max(new$created_at) <= latest_dt) {
    if (verbose) message("No new tweets to pull.")
    FALSE
  } else {
    TRUE
  }
}
```


Without doing anything else, we can check that both targets which had been up to date has now been invalidated.

```{r}
outdated(burner_config)
vis_drake_graph(burner_config)
```

Notice that we didn't even need to re-define our plan; `drake` can tell that a function a target in our plan relies on has meaningfully changed.

When we re-`make` our plan we should be messaged an `r emo::ji("smile")`.

```{r}
make(burner_plan)
```

And now `seed_burn` and `full_burn` should both be up to date again.

```{r}
outdated(burner_config)
vis_drake_graph(burner_config)
```


### Add Triggers

I mentioned that a way to guarantee that targets are re-made even if code *doesn't* change is to associate a trigger with a target.

Let's add a [trigger](https://ropensci.github.io/drake/articles/debug.html#test-with-triggers-) so that we always run `get_tweets` to look for new tweets at the burner handle.


```{r}
burner_plan_2 <-
  drake_plan(
    seed_burn = get_tweets(
      user = burner_handle,
      input_path = burner_path
    ),

    full_burn = target(
      command = get_tweets(
        tbl = seed_burn,
        user = burner_handle,
        output_path = burner_path
      ),
      trigger = trigger(
        condition = TRUE
      )
    )
  )
```


This trigger will make our `full_burn` target look always out of date to `drake` since it knows we need to re-make `full_burn` every time `make()` is run.

Our `condition` always evaluates to `TRUE` because it's just the value `TRUE`, but you can also sub in any expression that returns a boolean.

```{r}
is_even_day <- function(input_date = Sys.Date()) {
  this_day <-
    input_date %>% lubridate::day()

  this_day %% 2 == 0
}

is_even_day()
is_even_day(Sys.Date() + 1)
is_even_day(Sys.Date() + 2)
```


### Test out the `burner_plan` 2.0

First we'll clean out the previous file we saved and start from scratch.

```{r}
if (file_exists(burner_path)) file_delete(burner_path)
```


Our `full_burn` target will always be out of date because the trigger indicates that it always needs to be rebuilt, but `seed_burn` will now be up-to-date.

```{r, eval=FALSE}
clean()

# Notice our new trigger column
burner_plan_2

# seed_burn is outdated becuase we haven't made the plan yet.
# full_burn is always out of date, no matter what, because of the trigger.
burner_config_2 <- drake_config(burner_plan_2)
outdated(burner_config_2)
vis_drake_graph(burner_config_2)
```

All tweets are now stored in `burner_path`, which we specified as our `output_path` when making `full_burn`.

What's stored in this file is the same as both `seed_burn` and `full_burn` becuase there were no tweets posted between when we made `seed_burn` and `full_burn`.

```{r}
make(burner_plan_2)

# Load the results of seed_burn and full_burn into our environment
loadd(seed_burn)
loadd(full_burn)
# These should be the same since no new tweets posted
expect_identical(seed_burn, full_burn)

# full_burn always outdated, seed_burn no longer outdated
outdated(burner_config_2)
vis_drake_graph(burner_config_2, targets_only = TRUE)
```


Now let's post a new tweet and re-`make` the plan.

`seed_burn` will be read in from the file, which reflects the state of the world before this tweet. Then `full_burn` will incorporate the new tweet.

```{r}
# Tweet here
post_tweet(status =
             digest::digest(sample(100, 1)),
           token = firewire_token)


# seed_burn (read from file) is up to date, and full_burn will always look outdated
make(burner_plan_2)
vis_drake_graph(burner_config_2, targets_only = TRUE)

loadd(seed_burn)
loadd(full_burn)
# We should have one extra row in full_burn than in seed_burn, because seed_burn wasn't rebuilt
expect_gt(nrow(full_burn), nrow(seed_burn))

seed_burn
full_burn

# Let's check that full_burn was saved to file
saved_burn <- read_csv(burner_path)
expect_identical(nrow(full_burn), nrow(saved_burn))
```

Now we've proven that we can successfully trigger the re-building of `full_burn` every time and save it as the latest state of the world to a file.

Every time new tweets arrive, that file will be updated at the end of the `make()` run.


### Write our full plan

Let's go back to our original NYCFireWire Twitter account.

For the purposes of illustration, I'll set a `max_id` on our `seed_fires` so that we can re-up and grab more tweets to build `fires`.

```{r}
fire_path <- here("data", "raw", "fires.csv")
if (file_exists(fire_path)) file_delete(fire_path)
```

```{r}
plan <-
  drake_plan(
    seed_fires = get_tweets(
      n_tweets_seed = 2,
      max_id = old_tweet_id,
      input_path = fire_path
    ),
    fires = target(
      command = get_tweets(
        tbl = seed_fires,
        n_tweets_reup = 3,
        output_path = fire_path
      ),
      trigger = trigger(
        condition = TRUE # Always look for new tweets
      )
    ),
    addresses = pull_addresses(fires), # Extract addresses from tweets
    lat_long = get_lat_long(addresses), # Send to Google for lat-longs
    fire_sums = count_fires(lat_long), # Sum up n fires per lat-long combo

    time_graph = graph_fire_times(lat_long),
    plot = plot_fire_sums(fire_sums)
  )
```

### Run our plan

```{r}
plan

make(plan, verbose = 4)

# See what a couple of our targets look like
loadd(addresses)
addresses

loadd(lat_long)
lat_long

loadd(time_graph)
time_graph
```

### Info drake stores

Let's get a little deeper into what `drake` stores in the `config`.

```{r, eval=FALSE}
config <- drake_config(plan)
sort(names(config))
config$plan
config$targets
config$cache_path
config$graph

# The dependencies of a given target encompass both functions (get_tweets) and targets that it depends on (seed_fires)
deps_target(addresses)

# You can get even more granular with
dependency_profile(addresses, config)

# Since we have a trigger on fires, everything downstream of fires is also perpetually outdated
outdated(config)
vis_drake_graph(config)

# We can clean just one target instead of cleaning everything
clean(fires)

# Since we didn't clean() seed_fires (upstream of fires), this target remains built and our dependency graph looks the same
vis_drake_graph(config)

# And again, if we re-make the plan we'll pull in new tweets because of our trigger
make(plan)
```

### Previously `geocode`d

For kicks, let's make our plots with data from 3000 tweets pulled from NYCFireWire and sent to Google for geocoding.

We'll stop at `pull_addresses(fires)` so we don't re-geocode all 3k fires.

We'll set the `input_path` to `seed_fires` to that file, and still grab the most recent 3 tweets.

```{r}
big_plan <-
  drake_plan(
    seed_fires = get_tweets( # Grab seed fires from file
      input_path = here("data", "raw", "lots_o_fires.csv")
    ),
    fires = target(
      command = get_tweets(
        tbl = seed_fires,
        n_tweets_reup = 3
      ),
      trigger = trigger(
        condition = TRUE
      )
    ),
    addresses = pull_addresses(fires)
    # Not sending all 3k addresses to Google
  )

make(big_plan)

loadd(addresses)
nrow(addresses)
addresses
```

Make our quick plot from the 3k fires:

```{r}
lat_long <-
  read_csv(here("data", "derived", "lat_long.csv"))

fire_sums <-
  read_csv(here("data", "derived", "fire_sums.csv"))

graph_fire_times(lat_long)
plot_fire_sums(fire_sums)
```


### That's all!

Any q's?