Extract or create metadata on publications and link to datasets. Results get exported to RCPublications.
We are currently prioritizing research publications that describe use cases for datasets, since that's what our intended recommendations need to focus on. For example, a use case for NOAA datasets is about coastal flooding and what the results mean for municipalities/states/etc. The publications that we're adding should be tending closer to how data are used for policy/decision making.
Clone https://github.com/NYU-CI/RCDatasets.
Metadata - primarily linkages between datasets and publications - will come from
our partners and clients. We want to capture information on the dataset, and publication metadata, including linkages to the datasets that we are enumerating in datasets.json.
- In this repo, in
/metadata, create a subfolder for the drop you are working with, and give it a name that reflects what's in it e.g.20190913_usda_excelis named with the date USDA sent it, the data provider (usda) and the format. - As you sift through the linkages additions to
datasets.json, if you come across datasets that are in a publication but not listed yet. When adding an entry todatasets.jsoncreate a new branch from https://github.com/NYU-CI/RCDatasets. It may be helpful to name the branch with the same name as your subfolder in/metadata.
At a minimum, each record in the datasets.json file must have these
required fields:
provider-- name of the data providertitle-- name of the datasetid-- a unique sequential identifier
For the names, use what the data provider shows on their web page and try to be as consise as possible.
When adding records:
- add to the bottom of the file
- increment the
idnumber manually - make sure not to introduce multiple names for the same provider
- make sure to remove any special characters or characters that will raise encoding errors
- all values should be string values (e.g if you see any dictionaries, those should be removed)
Other fields that may be included:
alt_title-- list of alternative titles or abbreviations, aka "mentions"url-- URL for the main page describing the datasetdoi-- a unique persistent identifier assigned by the data provideralt_ids-- other unique identifiers (alternative DOIs, etc.)description-- a brief (tweet sized) text description of the datasetdate-- date of publication, which may help resolve conflicting identifiers
Example entry:
{
"id": "dataset-058",
"provider": "Bureau of Labor Statistics",
"title": "Consumer Price Index",
"alt_title": [
"HEI"
],
"url": "https://www.bls.gov/cpi/",
"description": "The Consumer Price Index (CPI) is a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services."
}
- Create a csv file in which you'll document the publication metadata (title, url, doi, etc). Be sure to keep track of the linkages with the
dataset_idthat you just created. Ultimately you will export the publication metadata to a json file; name that according to the data drop as well.
At a minimum, each record in the <your_unique_name>_publications.json file must have these required fields:
title-- name of the publicationurl-- URL for the main page describing the datasetrelated_dataset--dataset_idfromdatasets.json.
Other fields that may be included:
doi-- a unique persistent identifier assigned by the data providertitle-- name of the datasetid-- a unique sequential identifier
Example entry in <your_unique_name>_publications.json:
{
"title": "Design Issues in USDA's Supplemental Nutrition Assistance Program: Looking Ahead by Looking Back",
"url": "https://www.ers.usda.gov/webdocs/publications/86924/err-243.pdf?v=43124",
"related_dataset": [
{"dataset_id": "dataset-026"}
]
}