Skip to content

Annotate invest inputs with keywords #2467

@davemfish

Description

@davemfish

Background/Motivation

We're working on InVEST - DataHub Integration and we want to enable accurate, semantic searching of the DataHub from the Workbench. Search will be triggered from a specific model input, and results should be a list of datasets from the DataHub that would be suitable for that input. Having a controlled vocabulary of keywords attached to invest inputs and to DataHub datasets will facilitate this.

Description

spec.Input types will have an optional keywords attribute. It will be a list of strings (or more likely a new Keyword type). The new types could look like this, where they include the keyword value as a string. As well as other metadata to reference the source of the keyword.

class GCMDKeyword(BaseModel):
    value: str
    uuid: str
    full_path: str
    vocabulary: str = 'Global Change Master Directory (GCMD) Keywords'


class InvestKeyword(BaseModel):
    value: str
    vocabulary: str = 'InVEST Keywords'

# Examples:

PAWC = InvestKeyword(value='PLANT AVAILABLE WATER CONTENT')

PRECIPITATION = GCMDKeyword(
    value='PRECIPITATION',
    uuid='1532e590-a62d-46e3-8d03-2351bc48166a',
    full_path='EARTH SCIENCE > ATMOSPHERE > PRECIPITATION')

We want the keywords to be standardized, so when possible we will use keywords defined in the NASA Global Change Master Directory of Keywords. https://gcmd.earthdata.nasa.gov/KeywordViewer

If there are no suitable matches there, we can define our own keyword, or reference another controlled vocabulary.

Benefits

In addition to enabling DataHub searches, using keywords from an established vocabulary like the GCMD could enable integration with other data catalogs or models that are annotated with that same vocabulary.

Mitigation of side effects

InVEST models pertain to several different scientific domains. While the GCMD is somewhat comprehensive -- and the community can request additions to it -- there is not complete agreement across domains about a standard vocabulary. Our approach can allow inclusions of keywords from any vocabulary, along with a reference to their source. Though maintaining all that within InVEST might get out of hand.

Alternatives

The "id", "name" and "about" attributes of inputs already have some text that could conceivably by used to do text-based searches on the DataHub, but some basic attempts yielded poor search results. In general, results were too broad and many items were not a good match for the specific input.

Metadata

Metadata

Assignees

Labels

DataHub IntegrationPertaining to the Workbench-DataHub integration project.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions