Skip to content

patternizer/MDR_searcher_public

Repository files navigation

MDR_searcher

Python code to search the MDR using a keyword (or complex combination) over all records and occurrences and return a list of URLs. Written for the TCE project to identify water quality datasets.

Contents

  • LICENSE.md - Open Government License v3
  • mdr-searcher.py - Python script to search the MDR using a keyword and return all records containing instances of the keyword.
  • mdr-searcher-batch.py - Python script to search the MDR using an array of keywords and return all record titles and links in a spreadsheet.
  • mdr-searcher-batch-extents.py - Batch MDR scraper that:
    • searches the MDR using an array of keywords,
    • returns all record titles and links,
    • extracts basic temporal and spatial extents plus data-layer classifications into combined Excel/CSV/JSON outputs.
    • Optional HTML fetch (disabled via --no-html-fetch to speed up large runs)
  • app.py - Plotly dashboard app to inspect returned MDR search results.
  • requirements.txt - Python libraries required to run the Plotly dashboard app.
  • /.github/workflows\pages.yaml - Github Actions YAML file.
  • /site/index.html - Static landing page instructions for local install and run.
  • changelog.md - Changelog of python code updates.

Setup / run

The first step is to clone the latest MDR_searcher code and step into the check out directory:

$ git clone https://github.com/CefasRepRes/MDR_searcher.git
$ cd MDR_searcher

$ python mdr-searcher.py [options]
$ python mdr-searcher-batch.py [options]
$ python mdr-searcher-batch-extents.py [options]

The python code has been tested using Python v3.12.3.

App

Interactive Plotly Dash app for exploring MDR metadata JSON files.

Installation

  1. Clone this repository

  2. Create and activate a virtual environment (optional but recommended):

    $ python -m venv .venv $ source .venv/bin/activate # on Windows: .venv\Scripts\activate

  3. Install dependencies:

    $ pip install -r requirements.txt

  4. Run the MDR searcher:

    $ python mdr-searcher-batch-extents.py --queries-file queries.txt --paginate-method url --no-html-fetch --max-pages 0 --out mdr_metadata

  5. Run the Dash app with a metadata JSON file:

    $ python app.py --json mdr_metadata_YYYYMMDD_HHMM.json --port 8050

  6. Then open your browser at: http://127.0.0.1:8050/

Command line options

Here are handy run-command “recipes” for the batch run script:

  1. (default) multi-keyword batch run, visible (headed), full URL query (unlimited UI pagination): $ python mdr-searcher-batch-extents.py --queries-file queries.txt --paginate-method url --no-html-fetch --max-pages 0 --out mdr_metadata
  2. Filter titles by the whole phrase (case-insensitive) python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match phrase --combined-out OUT\mdr_links_combined_phrase.xlsx --max-pages 20
  3. Keep rows where any term (or quoted phrase) from the query appears in the title python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match any --combined-out OUT\mdr_links_combined_any.xlsx --max-pages 20
  4. Keep rows where all terms (and quoted phrases) appear in the title python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --title-match all --combined-out OUT\mdr_links_combined_all.xlsx --max-pages 20
  5. Multiple queries inline (no file) python mdr-scraper-batch.py --query "water quality" --query "eutrophication" --query "plankton" --headed --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
  6. Write per-keyword files to a folder and de-dupe links in the combined file python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --out-dir ./mdr_out --combined-out OUT\mdr_links_combined.xlsx --dedupe-links --max-pages 20
  7. Use URL-based pagination instead of clicking “Next” python mdr-scraper-batch.py --queries-file DATA\queries.txt --headed --paginate-method url --combined-out OUT\mdr_links_combined.xlsx --max-pages 20
  8. Single keyword + title filter + URL pagination (example combo) python mdr-scraper-batch.py --query "water quality AND eutrophication" --headed --title-match any --paginate-method url --combined-out OUT\mdr_links_combined_any.xlsx --max-pages 25
  9. Headless (only if your login/session already works headless) python mdr-scraper-batch.py --queries-file DATA\queries.txt --combined-out OUT\mdr_links_combined.xlsx --max-pages 20

Notes: • --headed is recommended so you can complete Microsoft sign-in once; the session is reused across all pages and keywords • Per-keyword files are saved as slugs of the query (e.g., water-quality.xlsx) in --out-dir (default) • Title filtering happens after fetching titles via the API: o --title-match phrase → raw query substring must appear in the title o --title-match any/all → quoted phrases are treated as single terms; AND/OR/NOT are ignored as boolean words

License

The code is distributed under terms and conditions of the Open Government License.

Contact information

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors