Python code to search the MDR using a keyword (or complex combination) over all records and occurrences and return a list of URLs. Written for the TCE project to identify water quality datasets.
LICENSE.md- Open Government License v3mdr-searcher.py- Python script to search the MDR using a keyword and return all records containing instances of the keyword.mdr-searcher-batch.py- Python script to search the MDR using an array of keywords and return all record titles and links in a spreadsheet.mdr-searcher-batch-extents.py- Batch MDR scraper that:- searches the MDR using an array of keywords,
- returns all record titles and links,
- extracts basic temporal and spatial extents plus data-layer classifications into combined Excel/CSV/JSON outputs.
- Optional HTML fetch (disabled via --no-html-fetch to speed up large runs)
app.py- Plotly dashboard app to inspect returned MDR search results.requirements.txt- Python libraries required to run the Plotly dashboard app./.github/workflows\pages.yaml- Github Actions YAML file./site/index.html- Static landing page instructions for local install and run.changelog.md- Changelog of python code updates.
The first step is to clone the latest MDR_searcher code and step into the check out directory:
$ git clone https://github.com/CefasRepRes/MDR_searcher.git
$ cd MDR_searcher
$ python mdr-searcher.py [options]
$ python mdr-searcher-batch.py [options]
$ python mdr-searcher-batch-extents.py [options]
The python code has been tested using Python v3.12.3.
Interactive Plotly Dash app for exploring MDR metadata JSON files.
-
Clone this repository
-
Create and activate a virtual environment (optional but recommended):
$ python -m venv .venv $ source .venv/bin/activate # on Windows: .venv\Scripts\activate
-
Install dependencies:
$ pip install -r requirements.txt
-
Run the MDR searcher:
$ python mdr-searcher-batch-extents.py --queries-file queries.txt --paginate-method url --no-html-fetch --max-pages 0 --out mdr_metadata
-
Run the Dash app with a metadata JSON file:
$ python app.py --json mdr_metadata_YYYYMMDD_HHMM.json --port 8050
-
Then open your browser at: http://127.0.0.1:8050/
Here are handy run-command “recipes” for the batch run script:
- (default) multi-keyword batch run, visible (headed), full URL query (unlimited UI pagination): $ python mdr-searcher-batch-extents.py --queries-file queries.txt --paginate-method url --no-html-fetch --max-pages 0 --out mdr_metadata
- Filter titles by the whole phrase (case-insensitive) python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match phrase --combined-out OUT\mdr_links_combined_phrase.xlsx --max-pages 20
- Keep rows where any term (or quoted phrase) from the query appears in the title python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match any --combined-out OUT\mdr_links_combined_any.xlsx --max-pages 20
- Keep rows where all terms (and quoted phrases) appear in the title python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match all --combined-out OUT\mdr_links_combined_all.xlsx --max-pages 20
- Multiple queries inline (no file) python mdr-scraper-batch.py --query "water quality" --query "eutrophication" --query "plankton" --headed --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
- Write per-keyword files to a folder and de-dupe links in the combined file python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --out-dir ./mdr_out --combined-out OUT\mdr_links_combined.xlsx --dedupe-links --max-pages 20
- Use URL-based pagination instead of clicking “Next” python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --paginate-method url --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
- Single keyword + title filter + URL pagination (example combo) python mdr-scraper-batch.py --query "water quality AND eutrophication" --headed --title-match any --paginate-method url --combined-out OUT\mdr_links_combined_any.xlsx --max-pages 25
- Headless (only if your login/session already works headless) python mdr-scraper-batch.py --queries-file DATA\queries.txt --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
Notes: • --headed is recommended so you can complete Microsoft sign-in once; the session is reused across all pages and keywords • Per-keyword files are saved as slugs of the query (e.g., water-quality.xlsx) in --out-dir (default) • Title filtering happens after fetching titles via the API: o --title-match phrase → raw query substring must appear in the title o --title-match any/all → quoted phrases are treated as single terms; AND/OR/NOT are ignored as boolean words
The code is distributed under terms and conditions of the Open Government License.