MDR_searcher

Python code to search the MDR using a keyword (or complex combination) over all records and occurrences and return a list of URLs. Written for the TCE project to identify water quality datasets.

LICENSE.md - Open Government License v3
mdr-searcher.py - Python script to search the MDR using a keyword and return all records containing instances of the keyword.
mdr-searcher-batch.py - Python script to search the MDR using an array of keywords and return all record titles and links in a spreadsheet.
mdr-searcher-batch-extents.py - Batch MDR scraper that:
- searches the MDR using an array of keywords,
- returns all record titles and links,
- extracts basic temporal and spatial extents plus data-layer classifications into combined Excel/CSV/JSON outputs.
- Optional HTML fetch (disabled via --no-html-fetch to speed up large runs)
app.py - Plotly dashboard app to inspect returned MDR search results.
requirements.txt - Python libraries required to run the Plotly dashboard app.
/.github/workflows\pages.yaml - Github Actions YAML file.
/site/index.html - Static landing page instructions for local install and run.
changelog.md - Changelog of python code updates.

Setup / run

The first step is to clone the latest MDR_searcher code and step into the check out directory:

$ git clone https://github.com/CefasRepRes/MDR_searcher.git
$ cd MDR_searcher

$ python mdr-searcher.py [options]
$ python mdr-searcher-batch.py [options]
$ python mdr-searcher-batch-extents.py [options]

The python code has been tested using Python v3.12.3.

App

Interactive Plotly Dash app for exploring MDR metadata JSON files.

Installation

Clone this repository
Create and activate a virtual environment (optional but recommended):

$ python -m venv .venv $ source .venv/bin/activate # on Windows: .venv\Scripts\activate
Install dependencies:

$ pip install -r requirements.txt
Run the MDR searcher:

$ python mdr-searcher-batch-extents.py --queries-file queries.txt --paginate-method url --no-html-fetch --max-pages 0 --out mdr_metadata
Run the Dash app with a metadata JSON file:

$ python app.py --json mdr_metadata_YYYYMMDD_HHMM.json --port 8050
Then open your browser at: http://127.0.0.1:8050/

Command line options

Here are handy run-command “recipes” for the batch run script:

(default) multi-keyword batch run, visible (headed), full URL query (unlimited UI pagination): $ python mdr-searcher-batch-extents.py --queries-file queries.txt --paginate-method url --no-html-fetch --max-pages 0 --out mdr_metadata
Filter titles by the whole phrase (case-insensitive) python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match phrase --combined-out OUT\mdr_links_combined_phrase.xlsx --max-pages 20
Keep rows where any term (or quoted phrase) from the query appears in the title python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match any --combined-out OUT\mdr_links_combined_any.xlsx --max-pages 20
Keep rows where all terms (and quoted phrases) appear in the title python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match all --combined-out OUT\mdr_links_combined_all.xlsx --max-pages 20
Multiple queries inline (no file) python mdr-scraper-batch.py --query "water quality" --query "eutrophication" --query "plankton" --headed --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
Write per-keyword files to a folder and de-dupe links in the combined file python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --out-dir ./mdr_out --combined-out OUT\mdr_links_combined.xlsx --dedupe-links --max-pages 20
Use URL-based pagination instead of clicking “Next” python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --paginate-method url --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
Single keyword + title filter + URL pagination (example combo) python mdr-scraper-batch.py --query "water quality AND eutrophication" --headed --title-match any --paginate-method url --combined-out OUT\mdr_links_combined_any.xlsx --max-pages 25
Headless (only if your login/session already works headless) python mdr-scraper-batch.py --queries-file DATA\queries.txt --combined-out OUT\mdr_links_combined.xlsx --max-pages 20

Notes: • --headed is recommended so you can complete Microsoft sign-in once; the session is reused across all pages and keywords • Per-keyword files are saved as slugs of the query (e.g., water-quality.xlsx) in --out-dir (default) • Title filtering happens after fetching titles via the API: o --title-match phrase → raw query substring must appear in the title o --title-match any/all → quoted phrases are treated as single terms; AND/OR/NOT are ignored as boolean words

License

The code is distributed under terms and conditions of the Open Government License.

Contact information

michael.taylor@cefas.gov.uk

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
site		site
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
app.py		app.py
mdr-dashboard.py		mdr-dashboard.py
mdr-searcher-batch-extents.py		mdr-searcher-batch-extents.py
mdr-searcher-batch.py		mdr-searcher-batch.py
mdr-searcher.py		mdr-searcher.py
mdr-watch.yml		mdr-watch.yml
mdr_metadata_20251201_2219.json		mdr_metadata_20251201_2219.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MDR_searcher

Contents

Setup / run

App

Installation

Command line options

License

Contact information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

patternizer/MDR_searcher_public

Folders and files

Latest commit

History

Repository files navigation

MDR_searcher

Contents

Setup / run

App

Installation

Command line options

License

Contact information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages