This repository provides a CLI tool to scrape posts from Medium publications and custom domains (for example, engineering blogs). It follows Clean Architecture and provides:
- Intelligent discovery (auto-discovery, known IDs, fallbacks)
- Support for custom domains and usernames
- A rich CLI (progress indicators and formatted output via Rich)
- YAML-based configuration for sources and bulk collections
This README is written to be fully reproducible: installation, configuration, commands, testing and troubleshooting are covered below.
Table of contents
- Overview
- Quick start
- Configuration (medium_sources.yaml)
- CLI usage and examples
- How it works (high-level)
- Tests
- Troubleshooting
- Project layout
- Contributing and license
- Entry point:
main.py(callssrc.presentation.cli.cli()) - Config file:
medium_sources.yaml(expected in repo root) - Default output folder:
outputs/ - Python: 3.9+ (see
pyproject.toml)
This tool resolves publication definitions, discovers post IDs (auto-discovery or fallbacks), fetches post details via adapters, and presents results in table/JSON/IDs formats.
- Create and activate a virtual environment (Linux/macOS):
python -m venv .venv
source .venv/bin/activate- Install editable package (recommended):
pip install --upgrade pip
pip install -e .- Verify CLI is available and view help:
python main.py --helpNotes:
- Main dependencies are declared in
pyproject.toml(httpx, click, rich, pyyaml, pytest, requests, etc.). - Installing in editable mode lets you make code changes without reinstalling.
The file medium_sources.yaml configures named sources and bulk collections. The included example in the repo contains many predefined keys (netflix, nytimes, airbnb, etc.).
Minimal example structure:
sources:
netflix:
type: publication
name: netflix
description: "Netflix Technology Blog"
auto_discover: true
custom_domain: false
defaults:
limit: 50
skip_session: true
format: json
output_dir: "outputs"
bulk_collections:
tech_giants:
description: "Major tech blogs"
sources: [netflix, airbnb]Key notes:
type:publicationorusername.name: publication name, domain, or@username. Iftypeisusernameandnamelacks@, the code will add it.custom_domain: set totruefor domains likeopen.nytimes.com.
Use python main.py --list-sources to list keys and descriptions from the YAML file.
Run from project root or after installing into your venv.
- List configured sources:
python main.py --list-sources- Scrape a configured source and save JSON:
python main.py --source nytimes --limit 20 --format json --output outputs/nytimes_posts.json- Scrape a publication directly:
python main.py --publication netflix --limit 5 --format table- Bulk collection (group from YAML):
python main.py --bulk tech_giants --limit 10 --format json- Auto-discovery and skip session (production-ready):
python main.py --publication pinterest --auto-discover --skip-session --format json --output results.json- Custom post IDs (comma-separated). Each must be exactly 12 alphanumeric characters:
python main.py --publication netflix --custom-ids "ac15cada49ef,64c786c2a3ac" --format jsonFlags summary:
-p, --publicationTEXT-s, --sourceTEXT-b, --bulkTEXT--list-sources-o, --outputTEXT-f, --format[table|json|ids]--custom-idsTEXT (comma-separated)--skip-session--limitINTEGER--all(collect all posts)-m, --mode[ids|metadata|full|technical] # presets that control enrichment and output--index# create/update a simple inverted index (JSON) for search
You can add or update sources directly from the CLI using the add-source subcommand. This writes to medium_sources.yaml and is useful when you want to quickly register a publication without editing the YAML manually.
Example — add Pinterest:
python main.py add-source \
--key pinterest \
--type publication \
--name pinterest \
--description "Pinterest Engineering" \
--auto-discoverNotes:
add-sourcepersists the change tomedium_sources.yamlin the repo root.- The command is implemented to avoid loading optional network adapters, so it can run even if dependencies like
httpxare not installed. - To see the result, run
python main.py --list-sources.
Interactive behavior and safety
-
If the source key you pass already exists in
medium_sources.yaml, the CLI will ask for confirmation before overwriting. This prevents accidental data loss when updating an existing source.Example (interactive prompt shown):
$ python main.py add-source --key pinterest --type publication --name pinterest --description "Pinterest Engineering" Source 'pinterest' already exists. Overwrite? [y/N]: y ✅ Source 'pinterest' added/updated in medium_sources.yaml -
To skip the interactive prompt and assume confirmation, use
--yes(or-y):python main.py add-source --key pinterest --type publication --name pinterest --description "Pinterest Engineering" --yes -
The CLI subcommand writes a normalized YAML entry (ensures booleans and required keys). It creates the
sourcesblock if it does not exist. -
After adding/updating a source you can:
- run
python main.py --list-sourcesto see the configured keys and descriptions; or - open
medium_sources.yamlto inspect the persisted entry.
- run
Implementation notes (for maintainers)
- The
add-sourcesubcommand avoids importing network adapters (e.g.httpx) when invoked so it can be used on systems where optional runtime dependencies are not installed. - The command is implemented in
src/presentation/cli.pyand usesSourceConfigManager.add_or_update_source(insrc/infrastructure/config/source_manager.py) to persist changes.
- The CLI bootstraps concrete adapters and repositories (e.g.
MediumApiAdapter,InMemoryPublicationRepository,MediumSessionRepository). - It creates domain services:
PostDiscoveryService,PublicationConfigService. ScrapePostsUseCaseorchestrates the flow: resolve config, initialize session (unless skipped), handle custom IDs or auto-discovery, collect posts.- The use case returns
ScrapePostsResponsecontainingPostentities. The CLI formats and optionally saves the response.
Run all tests:
python -m pytest tests/ -vRun only unit or integration tests:
python -m pytest tests/unit/ -v
python -m pytest tests/integration/ -vCoverage reports are configured in pyproject.toml and generate htmlcov/.
Latest test run (local): TOTAL coverage 84%.
- HTML report:
htmlcov/index.html(generated by pytest-cov) - XML report:
coverage.xml
Notes:
- Coverage is computed with pytest-cov. The HTML report lives in
htmlcov/after runningpytest --cov=src --cov-report=html. - Some adapter branches remain partially covered; see
src/infrastructure/adapters/medium_api_adapter.pyfor areas to target next. - If you want a live badge that updates automatically, add CI with coverage upload (Codecov or Coveralls). See the
Add CI with GitHub Actionstask in the project TODO.
- Missing
medium_sources.yaml:SourceConfigManagerraisesFileNotFoundErrorwhen calling--sourceor--list-sources. - Custom IDs validation:
PostIdrequires exactly 12 alphanumeric characters; invalid IDs will raise a validation error. - Empty result / errors: the use case catches exceptions and returns an empty response; the CLI prints helpful troubleshooting tips. Use logging or run in a development environment to debug further.
- Output directory: default
outputs/(CLI will create it if missing).
main.py— entry pointsrc/presentation/cli.py— CLI orchestration, formatting and progress UIsrc/application/use_cases/scrape_posts.py— main use casesrc/domain/entities/publication.py— domain entitiessrc/infrastructure/config/source_manager.py— YAML loadersrc/infrastructure/adapters/medium_api_adapter.py— adapter for external API logicsrc/infrastructure/content_extractor.py— HTML → Markdown conversion, code extraction and heuristics-based classifiersrc/infrastructure/persistence.py— persisting Markdown, JSON metadata and assetssrc/infrastructure/indexer.py— simple inverted index writer used by persistence
The repository already includes CONTRIBUTING.md with development workflow, testing and coding standards. Please follow it when contributing.
MIT — see LICENSE for details.
This project now includes a technical extraction pipeline that converts full post HTML into:
- A Markdown rendering of the post content
- Extracted assets (images and other linked files), downloaded locally
- A JSON metadata file per post that includes extracted code blocks and a lightweight "technical" classifier
- An optional inverted index (
index.json) for simple token-based search
These artifacts are created when the CLI runs with --mode full or --mode technical and when the chosen output --format includes md or when --index is passed.
Typical outputs (when using --format md and --index):
outputs/<source_key>/<post_id>.md— Markdown contentoutputs/<source_key>/<post_id>.json— Metadata (title, authors, date, code_blocks, classifier, original URL)outputs/<source_key>/assets/<post_id>/<asset_filename>— downloaded assets referenced from the postoutputs/index.json— simple inverted index mapping tokens to posts (updated when--indexis requested)
Example usage:
python main.py --source pinterest --mode technical --format md --output outputs/pinterest --indexNotes for maintainers and contributors
- The HTML→Markdown conversion lives in
src/infrastructure/content_extractor.py. It uses BeautifulSoup andmarkdownifywhen available and falls back to conservative HTML cleaning heuristics otherwise. - Code block extraction and language detection are implemented with heuristics and a Pygments-based fallback; results appear under the
code_blockskey in the per-post JSON metadata. - The lightweight classifier is intentionally heuristic for now (presence/volume of code blocks, presence of technical keywords). A machine-learning classifier can be plugged in later — the extractor and persistence functions accept optional hooks for that.
- Persistence (writing
.md,.json, downloading assets and updating the index) is implemented insrc/infrastructure/persistence.pyand usessrc/infrastructure/indexer.pyto maintainindex.json.
Security and safety
- Filenames for assets are sanitized before writing to disk. The persistence layer will not overwrite existing files unless explicitly allowed.
- The index is intentionally simple and file-based; for larger collections consider migrating to a proper search engine (Elasticsearch, Meilisearch, SQLite FTS, etc.).
Where to start when improving extraction
- Improve Markdown fidelity: tweak
content_extractor.html_to_markdown()and add post-processing steps for link rewriting and imagesrcnormalization. - Harden asset handling: dedupe asset downloads, support remote storage backends, and canonicalize filenames.
- Replace the heuristic classifier with an ML model: extractor already records
code_blocksand features to make training easier.
If you'd like, I can add a book/ chapter stub that documents the extraction flow and includes exercises for readers — choose "yes" and I'll create it now.
Next steps I can take (pick one):
- run the test suite and report results
- run a sample scrape and include a short JSON example
- add a
--verboseflag to the CLI to improve debug output