Skip to content

betydata R data package with BETYdb public data export#12

Open
divine7022 wants to merge 44 commits intomainfrom
mvp-betydata
Open

betydata R data package with BETYdb public data export#12
divine7022 wants to merge 44 commits intomainfrom
mvp-betydata

Conversation

@divine7022
Copy link
Collaborator

Summary

Initial release of betydata, an R data package providing offline access to
public data from BETYdb

  • 16 datasets: traitsview (43,532 rows) + 15 reference tables
  • Multiple formats: .rda (lazy-loaded), Parquet, Frictionless datapackage.json
  • Filtered to public data only (access_level = 4, checked >= 0)
  • Complete roxygen2 documentation for all datasets
  • Package-level documentation with BETYdb context
  • Data quality policy in README (checked column, access levels)

Vignettes

  • orientation: Package overview and data relationships
  • sql-analogs: Migrate BETYdb SQL queries to dplyr
  • pfts-priors: Working with PFTs and Bayesian priors
  • manuscript: Reproduce LeBauer et al. (2018) analyses

Datasets

Dataset Description
traitsview Primary trait/yield observations (43,532 × 36)
species Plant taxonomy
sites Research site locations
variables Trait definitions and units
citations Literature references
pfts Plant functional types
priors Bayesian prior distributions
+ 9 more Support and relationship tables

implements #1, #2, #3, #4, #5, #6, #7, #8, #9, #10, #11

@divine7022 divine7022 requested a review from dlebauer February 11, 2026 21:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR delivers the initial release (v0.1.0) of betydata, an R data package providing offline access to public data from the BETYdb (Biofuel Ecophysiological Traits and Yields) database. The package enables reproducible analyses of plant traits and crop yields without requiring database connectivity.

Changes:

  • Complete R package structure with 16 datasets (traitsview + 15 support tables) totaling 43,532+ trait and yield records
  • Multiple data formats: lazy-loaded .rda files, Parquet alternatives, and Frictionless metadata (datapackage.json)
  • Comprehensive documentation: roxygen2 docs for all datasets, 4 vignettes (orientation, sql-analogs, pfts-priors, manuscript), and GitHub issue templates
  • Quality controls: excludes checked=-1 records, public data only (access_level >= 4), full test coverage
  • CI/CD infrastructure: GitHub Actions R-CMD-check workflow, testthat 3.0 test suite

Reviewed changes

Copilot reviewed 38 out of 71 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
DESCRIPTION Package metadata and dependencies; minor email format issue
CITATION.cff Citation metadata; email and missing preferred-citation issues
LICENSE BSD-3-Clause license file
README.md Comprehensive package documentation; table formatting issue
NEWS.md Release notes documenting v0.1.0
R/betydata-package.R Package-level documentation
R/data.R Roxygen2 documentation for all 16 datasets
man/*.Rd Generated documentation files for datasets
vignettes/*.Rmd Four tutorial vignettes; minor issues in manuscript.Rmd and pfts-priors.Rmd
tests/testthat/*.R Test suite for data and metadata validation; deprecated context() calls
data-raw/make-data.R Data build script for generating .rda and Parquet files
inst/metadata/datapackage.json Frictionless Data package metadata
inst/extdata/parquet/*.parquet Sample Parquet data files
data/*.rda Binary R data files (compressed with xz)
.github/workflows/*.yaml GitHub Actions CI configuration
.github/ISSUE_TEMPLATE/*.md Issue templates for data corrections and verifications
.gitignore, .Rbuildignore Build and version control configuration; CSV exclusion concern
Comments suppressed due to low confidence (2)

tests/testthat/test-metadata.R:3

  • The context() function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package uses testthat (>= 3.0.0) and has Config/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.
    tests/testthat/test-data.R:3
  • The context() function on line 3 is deprecated in testthat 3.0.0 and later. According to the DESCRIPTION file, this package uses testthat (>= 3.0.0) and has Config/testthat/edition: 3. The context() calls should be removed as they are no longer needed and will generate warnings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


1. betydata excludes `checked = -1` (failed QA/QC records)
2. Snapshot date: betydata was exported on `r format(Sys.Date(), "%Y-%m-%d")`; the manuscript used 2017 data
3. Access level filtering: betydata includes only public data (`access_level < 4`)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The access level comparison note contains an error. The text states "access_level < 4" but according to the README and elsewhere in the code, the package includes only public data where "access_level >= 4" (not less than 4). This is the opposite condition and needs to be corrected to ">= 4".

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, that is a typo

README.md Outdated
Comment on lines +31 to +46
| Dataset | Rows | Columns | Description |
|---------------|--------|---------|----------------------------------------------|
| `traitsview` | 43,532 | 36 | Denormalized view of plant traits and yields |
| Dataset | Description |
|---------------|---------------------------------------------------------------|
| `species` | Plant taxonomy (genus, species, common names) |
| `sites` | Research site locations with coordinates and climate data |
| `variables` | Trait/variable definitions, units, and valid ranges |
| `citations` | Literature references (author, year, title, DOI) |
| `cultivars` | Plant cultivar and variety information |
| `treatments` | Experimental treatment definitions |
| `managements` | Management events (planting, harvest, fertilization) |
| `methods` | Measurement method descriptions |
| `pfts` | Plant Functional Type definitions for ecological modeling |
| `priors` | Prior probability distributions for Bayesian analysis |
| `entities` | Entity identifiers for repeated measures |
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README contains a malformed table structure. Lines 31-33 show a table header for the Primary Dataset, but then lines 34-46 continue with a different table that has incompatible headers (missing "Rows" and "Columns" columns). This creates a broken table rendering. The support tables section should have its own separate table header.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. The README tables are now split into three properly separated sections with their own headers: ### Primary Table, ### Metadata Tables, and ### Relationship Tables. Each has its own markdown table with consistent column structure

Comment on lines +200 to +220
if (length(sla_data) > 10 && exists("x") && exists("y")) {
# Create plot comparing prior to histogram of data
ggplot() +
geom_histogram(
data = data.frame(sla = sla_data),
aes(x = sla, y = after_stat(density)),
bins = 30, fill = "steelblue", alpha = 0.6
) +
geom_line(
data = data.frame(x = x, y = y),
aes(x, y),
color = "red", linewidth = 1, linetype = "dashed"
) +
labs(
title = "SLA: Prior Distribution vs. Observed Data",
subtitle = "Red dashed = prior, Blue = observed data (Miscanthus + Panicum)",
x = "SLA (m2/kg)",
y = "Density"
) +
xlim(0, 80)
}
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code at line 200 checks for exists("x") && exists("y") but these variables (x and y) are created within a previous code chunk that only executes conditionally (if (nrow(sla_priors) > 0)). This creates a fragile dependency where the plot will only render if both the SLA priors exist AND the earlier chunk successfully created x and y variables. This code should either store x and y in a way that persists across chunks or restructure the logic to avoid this cross-chunk dependency.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find, and the issue was actually worse than noted, standalone variables x and y were never created anywhere in the vignette. The prior chunk stored results, but never assigned
fixed in by explicitly creating prior_x and prior_y variables in prior visualization chunk

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Member

@dlebauer dlebauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a quick first review. On a future review I will go through all of the vignettes and explore the tables as they exist.

I am now wondering if we should 1) store the data in CSV files to allow text-based version control and 2) if we can reconstruct traitsview on the fly from the component datasets (i.e. traitsview should not be in data_raw)

"path": "https://doi.org/10.1111/gcbb.12420"
}
],
"resources": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it looks like only the traits view dataset has enumerated fields - is that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a starting point since traitsview is the primary table, but should not be intentional;
All 16 resources now have complete schema definitions in datapackage.json


## Available Datasets

The package exports 16 datasets. List them all:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more correct to call betydata a dataset with multiple tables, rather than referring to each table as a 'dataset'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed!, changed throughout -- README, vignettes, and docs now refer to "a dataset with 16 tables" rather than "16 datasets." Each exported entity is a table, the package as a whole is the dataset

names(traitsview)
```

### Key Columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose that we re-organize traitsview a bit so that the key cols are first, and the ids are all to the right. The goal is to make it easier on end users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 👍


```{r basic-exploration}
# Preview
head(traitsview[, c("trait", "mean", "units", "scientificname", "author")])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we put the key cols first and use tibbles, then the preview could simply be:

traitsview

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly right! all vignettes now use just traitsview for preview -- tibbles print 10 rows by default;
Removed all head(traitsview[, c(...)]) patterns

table(traitsview$checked, useNA = "ifany")

# Work with verified records only
verified <- traitsview[traitsview$checked == 1, ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets consistently use dplyr verbs. They are easier to read

verified <- traitsview |> 
    filter(checked == 1)

README.md Outdated

### Relationship Tables

| Dataset | Description |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Dataset | Description |
| Table | Description |

| `priors` | Prior probability distributions for Bayesian analysis |
| `entities` | Entity identifiers for repeated measures |

### Relationship Tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Briefly explain - how are these used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added explanation under ### Relationship Tables

README.md Outdated
library(dplyr)
traitsview |> count(trait, sort = TRUE)

# Count by genus (top bioenergy crops)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these won't be limited to bioenergy crops since they are not filtered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, added an actual bioenergy genera filter to Quick Start

README.md Outdated
traitsview |> count(trait, sort = TRUE)

# Count by genus (top bioenergy crops)
traitsview |> count(genus, sort = TRUE) |> head(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have a new line after each |>

Suggested change
traitsview |> count(genus, sort = TRUE) |> head(10)
traitsview |>
count(genus, sort = TRUE)

And then rely on the default printing behavior of tibbles to summarize the tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied throughout, all pipe chains use multi-line format. Removed all head() calls -- tibbles summarize to 10 rows by default

README.md Outdated

**Note:** This package exports only `checked >= 0` data. Flagged records (`checked = -1`) are excluded during data preparation. For research requiring unchecked data, access the BETYdb PostgreSQL database directly.

### Access Levels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the access_level columns and all references to the 'access_level' other than to say once that this package includes all public data from BETYdb

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
changes:

  1. make-data.R: traitsview$access_level <- NULL -- column dropped during build
  2. traitsview now has 35 columns (was 36)
  3. README: removed entire "Access Levels" section. Single sentence: "All data in this package is public (from BETYdb records with access_level = 4)."
  4. tests removed access_level test, updated column count to 35
  5. Vignettes: all access_level references removed

@divine7022 divine7022 requested a review from dlebauer March 2, 2026 02:01
…re-install the package into the standard library before R CMD check runs
…y skip code execution if the package is unavailable (e.g. someone running R CMD check locally without quarto's library path fix), instead of crashing with an error
@divine7022
Copy link
Collaborator Author

  1. store the data in CSV files to allow text-based version control

For v0.1.0, i kept the conventional approach: CSVs in data-raw/csv/ (build source, gitignored) and .rda in data/ (shipped format); this matches R data package conventions and keeps the repo size manageable.

For text based change tracking, one option is to start version controlling data-raw/csv/ (remove from .gitignore). This would give diffable change visibility without breaking R package conventions. The .rda files would still be the shipped format for lazydata

happy to implement this if you prefer it for v0.2.0

if we can reconstruct traitsview on the fly from the component datasets

currently no -- the core trait/yield records (mean, n, stat, checked, etc.) exist only within the denormalized traitsview.csv. The support tables (species, sites, citations...) are reference/lookup tables but don't contain the actual measurements themselves

To reconstruct traitsview on the fly, we would need to also export the raw traits and yields tables from BETYdb (with their foreign keys), then join them to the dimension tables in R.

For now, shipping the pre-built traitsview is the most practical approach. If we want to move toward a normalized structure in a future version, we could export traits and yields as separate tables and add a helper function to join them. Could be a good goal for v0.2.0.

@divine7022
Copy link
Collaborator Author

Heads-up : https://github.com/PecanProject/betydata/actions/runs/22558371999/job/65339932874?pr=12

Just drafting a note on windows CI: windows R CMD check was failing with there is no package called 'betydata' during vignette rendering. This is a known quarto vignette engine issue (tracked at quarto-dev issue -- #217) -- quarto spawns a separate R subprocess that doesn't inherit the temporary library path used during R CMD check, so library(betydata) fails in that subprocess.

Workaround applied:

  1. added local::. to extra-packages in R-CMD-check.yaml to pre-install the package before check runs
  2. added requireNamespace("betydata") eval guard in each vignette as a fallback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants