-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Background/Motivation
We're working on InVEST - DataHub Integration and we want to enable accurate, semantic searching of the DataHub from the Workbench. Search will be triggered from a specific model input, and results should be a list of datasets from the DataHub that would be suitable for that input. Having a controlled vocabulary of keywords attached to invest inputs and to DataHub datasets will facilitate this.
Description
spec.Input types will have an optional keywords attribute. It will be a list of strings (or more likely a new Keyword type). The new types could look like this, where they include the keyword value as a string. As well as other metadata to reference the source of the keyword.
class GCMDKeyword(BaseModel):
value: str
uuid: str
full_path: str
vocabulary: str = 'Global Change Master Directory (GCMD) Keywords'
class InvestKeyword(BaseModel):
value: str
vocabulary: str = 'InVEST Keywords'
# Examples:
PAWC = InvestKeyword(value='PLANT AVAILABLE WATER CONTENT')
PRECIPITATION = GCMDKeyword(
value='PRECIPITATION',
uuid='1532e590-a62d-46e3-8d03-2351bc48166a',
full_path='EARTH SCIENCE > ATMOSPHERE > PRECIPITATION')We want the keywords to be standardized, so when possible we will use keywords defined in the NASA Global Change Master Directory of Keywords. https://gcmd.earthdata.nasa.gov/KeywordViewer
If there are no suitable matches there, we can define our own keyword, or reference another controlled vocabulary.
Benefits
In addition to enabling DataHub searches, using keywords from an established vocabulary like the GCMD could enable integration with other data catalogs or models that are annotated with that same vocabulary.
Mitigation of side effects
InVEST models pertain to several different scientific domains. While the GCMD is somewhat comprehensive -- and the community can request additions to it -- there is not complete agreement across domains about a standard vocabulary. Our approach can allow inclusions of keywords from any vocabulary, along with a reference to their source. Though maintaining all that within InVEST might get out of hand.
Alternatives
The "id", "name" and "about" attributes of inputs already have some text that could conceivably by used to do text-based searches on the DataHub, but some basic attempts yielded poor search results. In general, results were too broad and many items were not a good match for the specific input.