diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index d2dd034d1..f27608ffc 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,112 +1,107 @@ -# Architecture of SHARE/Trove -> NOTE: this document requires update (big ol' TODO) - +# Architecture of SHARE/trove This document is a starting point and reference to familiarize yourself with this codebase. ## Bird's eye view -In short, SHARE/Trove takes metadata records (in any supported input format), -ingests them, and makes them available in any supported output format. -``` - ┌───────────────────────────────────────────┐ - │ Ingest │ - │ ┌──────┐ │ - │ ┌─────────────────────────┐ ┌──►Format├─┼────┐ - │ │ Normalize │ │ └──────┘ │ │ - │ │ │ │ │ ▼ -┌───────┐ │ │ ┌─────────┐ ┌────────┐ │ │ ┌──────┐ │ save as -│Harvest├─┬─┼─┼─►Transform├──►Regulate├─┼─┬─┼──►Format├─┼─┬─►FormattedMetadataRecord -└───────┘ │ │ │ └─────────┘ └────────┘ │ │ │ └──────┘ │ │ - │ │ │ │ │ . │ │ ┌───────┐ - │ │ └─────────────────────────┘ │ . │ └──►Indexer│ - │ │ │ . │ └───────┘ - │ └─────────────────────────────┼─────────────┘ some formats also - │ │ indexed separately - ▼ ▼ - save as save as - RawDatum NormalizedData +In short, SHARE/trove holds metadata records that describe things and makes those records available for searching, browsing, and subscribing. + + + + +## Parts +a look at the tangles of communication between different parts of the system: + +```mermaid +graph LR; + subgraph shtrove; + subgraph web[api/web server]; + ingest; + search; + browse; + rss; + atom; + oaipmh; + end; + worker["background worker (celery)"]; + indexer["indexer daemon"]; + rabbitmq["task queue (rabbitmq)"]; + postgres["database (postgres)"]; + elasticsearch; + web---rabbitmq; + web---postgres; + web---elasticsearch; + worker---rabbitmq; + worker---postgres; + worker---elasticsearch; + indexer---rabbitmq; + indexer---postgres; + indexer---elasticsearch; + end; + source["metadata source (e.g. osf.io backend)"]; + user["web user, either by browsing directly or via web app (like osf.io)"]; + subscribers["feed subscription tools"]; + source-->ingest; + user-->search; + user-->browse; + subscribers-->rss; + subscribers-->atom; + subscribers-->oaipmh; ``` ## Code map A brief look at important areas of code as they happen to exist now. -### Static configuration - -`share/schema/` describes the "normalized" metadata schema/format that all -metadata records are converted into when ingested. - -`share/sources/` describes a starting set of metadata sources that the system -could harvest metadata from -- these will be put in the database and can be -updated or added to over time. - -`project/settings.py` describes system-level settings which can be set by -environment variables (and their default values), as well as settings -which cannot. - -`share/models/` describes the data layer using the [Django](https://www.djangoproject.com/) ORM. - -`share/subjects.yaml` describes the "central taxonomy" of subjects allowed -in `Subject.name` fields of `NormalizedData`. - -### Harvest and ingest - -`share/harvest/` and `share/harvesters/` describe how metadata records -are pulled from other metadata repositories. - -`share/transform/` and `share/transformers/` describe how raw data (possibly -in any format) are transformed to the "normalized" schema. +- `trove`: django app for rdf-based apis + - `trove.digestive_tract`: most of what happens after ingestion + - stores records and identifiers in the database + - initiates indexing + - `trove.extract`: parsing ingested metadata records into resource descriptions + - `trove.derive`: from a given resource description, create special non-rdf serializations + - `trove.render`: from an api response modeled as rdf graph, render the requested mediatype + - `trove.models`: database models for identifiers and resource descriptions + - `trove.trovesearch`: builds rdf-graph responses for trove search apis (using `IndexStrategy` implementations from `share.search`) + - `trove.vocab`: identifies and describes concepts used elsewhere + - `trove.vocab.trove`: describes types, properties, and api paths in the trove api + - `trove.vocab.osfmap`: describes metadata from osf.io (currently the only metadata ingested) + - `trove.openapi`: generate openapi json for the trove api from thesaurus in `trove.vocab.trove` +- `share`: django app with search indexes and remnants of sharev2 + - `share.models`: database models for external sources, users, and other system book-keeping + - `share.oaipmh`: provide data via [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html) + - `share.search`: all interaction with elasticsearch + - `share.search.index_strategy`: abstract base class `IndexStrategy` with multiple implementations, for different approaches to indexing the same data + - `share.search.daemon`: the "indexer daemon", an optimized background worker for batch-processing updates and sending to all active index strategies + - `share.search.index_messenger`: for sending messages to the indexer daemon +- `api`: django app with remnants of the legacy sharev2 api + - `api.views.feeds`: allows custom RSS and Atom feeds + - otherwise, subject to possible deprecation +- `osf_oauth2_adapter`: django app for login via osf.io +- `project`: the actual django project + - default settings at `project.settings` + - pulls together code from other directories implemented as django apps (`share`, `trove`, `api`, and `osf_oauth2_adapter`) -`share/regulate/` describes rules which are applied to every normalized datum, -regardless where or what format it originally come from. -`share/metadata_formats/` describes how a normalized datum can be formatted -into any supported output format. - -`share/tasks/` runs the harvest/ingest pipeline and stores each task's status -(including debugging info, if errored) as a `HarvestJob` or `IngestJob`. - -### Outward-facing views - -`share/search/` describes how the search indexes are structured, managed, and -updated when new metadata records are introduced -- this provides a view for -discovering items based on whatever search criteria. - -`share/oaipmh/` describes the [OAI-PMH](https://www.openarchives.org/OAI/openarchivesprotocol.html) -view for harvesting metadata from SHARE/Trove in bulk. - -`api/` describes a mostly REST-ful API that's useful for inspecting records for -a specific item of interest. - -### Internals - -`share/admin/` is a Django-app for administrative access to the SHARE database -and pipeline logs - -`osf_oauth2_adapter/` is a Django app to support logging in to SHARE via OSF +## Cross-cutting concerns -### Testing +### Resource descriptions -`tests/` are tests. +Uses the [resource description framework](https://www.w3.org/TR/rdf11-primer/#section-Introduction): +- the content of each ingested metadata record is an rdf graph focused on a specific resource +- all api responses from `trove` views are (experimentally) modeled as rdf graphs, which may be rendered a variety of ways -## Cross-cutting concerns +### Identifiers -### Immutable metadata +Whenever feasible, use full URI strings to identify resources, concepts, types, and properties that may be exposed outwardly. -Metadata records at all stages of the pipeline (`RawDatum`, `NormalizedData`, -`FormattedMetadataRecord`) should be considered immutable -- any updates -result in a new record being created, not an old record being altered. +Prefer using open, standard, well-defined namespaces wherever possible ([DCAT](https://www.w3.org/TR/vocab-dcat-3/) is a good place to start; see `trove.vocab.namespaces` for others already in use). When app-specific concepts must be defined, use the `TROVE` namespace (`https://share.osf.io/vocab/2023/trove/`). -Multiple records which describe the same item/object are grouped by a -"source-unique identifier" or "suid" -- essentially a two-tuple -`(source, identifier)` that uniquely and persistently identifies an item in -the source repository. In most outward-facing views, default to showing only -the most recent record for each suid. +A notable exception (non-URI identifier) is the "source-unique identifier" or "suid" -- essentially a two-tuple `(source, identifier)` that uniquely and persistently identifies a metadata record in a source repository. This `identifier` may be any string value, provided by the external source. ### Conventions (an incomplete list) -- functions prefixed `pls_` ("please") are a request for something to happen +- local variables prefixed with underscore (to consistently distinguish between internal-only names and those imported/built-in) +- prefer full type annotations in python code, wherever reasonably feasible ## Why this? inspired by [this writeup](https://matklad.github.io/2021/02/06/ARCHITECTURE.md.html) diff --git a/CHANGELOG.md b/CHANGELOG.md index d8af0c86a..c92dbfcf6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,10 @@ # Change Log +# [25.5.1] - 2025-08-21 +- improve error handling in celery task-result backend +- use logging config in celery worker +- improve code docs (README.md et al.) + # [25.5.0] - 2025-07-15 - use python 3.13 - use `poetry` to manage dependencies diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index d14287ddb..ca8dcf691 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,7 +1,18 @@ # CONTRIBUTING -TODO: how do we want to guide community contributors? +> note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools that should be more accessible to community contribution -For now, if you're interested in contributing to SHARE/Trove, feel free to +For now, if you're interested in contributing to SHARE/trove, feel free to [open an issue on github](https://github.com/CenterForOpenScience/SHARE/issues) and start a conversation. + +## Required checks + +All changes must pass the following checks with no errors: +- linting: `python -m flake8` +- static type-checking (on `trove/` code only, for now): `python -m mypy trove` +- tests: `python -m pytest -x tests/` + - note: some tests require other services running -- if [using the provided docker-compose.yml](./how-to/run-locally.md), recommend running in the background (upping worker ups all: `docker compose up -d worker`) and executing tests from within one of the python containers (`indexer`, `worker`, or `web`): + `docker compose exec indexer python -m pytest -x tests/` + +All new changes should also avoid decreasing test coverage, when reasonably possible (currently checked on github pull requests). diff --git a/README.md b/README.md index 27a21f903..201adfc2b 100644 --- a/README.md +++ b/README.md @@ -1,33 +1,17 @@ -# SHARE/Trove +# SHARE/trove (aka SHARtrove, shtrove) -SHARE is creating a free, open dataset of research (meta)data. +> share (verb): to have or use in common. -> **Note**: SHARE’s open API tools and services help bring together scholarship distributed across research ecosystems for the purpose of greater discoverability. However, SHARE does not guarantee a complete aggregation of searched outputs. For this reason, SHARE results should not be used for methodological analyses, such as systematic reviews. +> trove (noun): a store of valuable or delightful things. -[](https://coveralls.io/github/CenterForOpenScience/SHARE?branch=develop) +SHARE/trove (aka SHARtrove, shtrove) is is a service meant to store (meta)data you wish to keep and offer openly. -## Documentation +note: this codebase is currently (and historically) rather entangled with [osf.io](https://osf.io), which has its shtrove at https://share.osf.io -- stay tuned for more-reusable open-source libraries and tools for working with (meta)data -### What is this? -see [WHAT-IS-THIS-EVEN.md](./WHAT-IS-THIS-EVEN.md) +see [ARCHITECTURE.md](./ARCHITECTURE.md) for help navigating this codebase -### How can I use it? -see [how-to/use-the-api.md](./how-to/use-the-api.md) +see [CONTRIBUTING.md](./CONTRIBUTING.md) for info about contributing changes -### How do I navigate this codebase? -see [ARCHITECTURE.md](./ARCHITECTURE.md) - -### How do I run a copy locally? -see [how-to/run-locally.md](./how-to/run-locally.md) - - -## Running Tests - -### Unit test suite - - py.test - -### BDD Suite - - behave +see [how-to/use-the-api.md](./how-to/use-the-api.md) for help using the api to add and access (meta)data +see [how-to/run-locally.md](./how-to/run-locally.md) for help running a shtrove instance for local development diff --git a/TODO.md b/TODO.md new file mode 100644 index 000000000..4b9d41b16 --- /dev/null +++ b/TODO.md @@ -0,0 +1,86 @@ +# TODO: +ways to better this mess + +## better shtrove api experience + +- better web-browsing experience + - when `Accept` header accepts html, use html regardless of query-params + - when query param `acceptMediatype` requests another mediatype, display on page in copy/pastable way + - exception: when given `withFileName`, download without html wrapping + - exception: `/trove/browse` should still give hypertext with clickable links + - include more explanatory docs (and better fill out those explanations) + - more helpful (less erratic) visual design + - in each html rendering of an api response, include a `