Automated system for extracting Brazilian news content from RSS feeds, scoring with intelligent pre-filtering, and submitting to AletheiaFact for fact-checking verification via OAuth2.
- Python 3.11+, FastAPI (async)
- MongoDB with Motor (async driver)
- APScheduler for periodic extraction
- Docker & Docker Compose
- OAuth2 Client Credentials (Ory Hydra)
cp .env.example .env
# Edit .env with your OAuth2 credentials
docker-compose up -d
docker-compose exec api python scripts/seed_sources.pyObtain M2M (machine-to-machine) OAuth2 credentials from AletheiaFact team and update .env with the provided client ID and secret.
Dashboard: http://localhost:8000
The system uses a Factory Pattern for flexible content extraction:
- ExtractorFactory: Routes to appropriate extractor based on
sourceType - RSSExtractor: Parses RSS/Atom feeds via feedparser (15 sources)
- HTMLExtractor: Scrapes static HTML pages via BeautifulSoup (1 source)
- Future: API extractors for JSON/REST sources
RSS Feeds ──┐
├→ ExtractorFactory → [RSSExtractor or HTMLExtractor] → PreFilter (scoring) → MongoDB
HTML Pages ─┘ ↓
(score ≥ 38 & status=pending)
↓
SubmissionService → OAuth2 → AletheiaFact API
Tiered Base Scoring (pick highest, not additive)
- Government entities: 12 pts
- Political keywords: 10 pts
- Domain keywords: 8 pts
Verifiable Data (10 pts each)
- Percentages, currency, numbers with context
Checkability Signals
- Direct quotes: +8 pts
- Attributions: +6 pts
- Named entities: +4 pts
Source Risk Priority
- Low credibility: 10 pts (HIGHEST PRIORITY - misinformation monitoring)
- Medium credibility: 5 pts
- High credibility: 3 pts
Context-Aware Penalties
- Speculation: -15 pts
- Conditional statements: -12 pts
- Vague language: -8 pts
Bonuses
- Official guidance: +6 pts
- Health/safety advisories: +8 pts
High Credibility (4 RSS): G1, Folha de S.Paulo, BBC Brasil, Estado de S.Paulo
Medium Credibility (6 RSS): CNN Brasil, Poder360, CartaCapital, Gazeta do Povo, Metrópoles, The Intercept Brasil
Low Credibility (5 RSS + 1 HTML): Terça Livre, Jornal da Cidade Online, Brasil 247, Conexão Política, DCM, Brasil Paralelo
Required environment variables:
# AletheiaFact API
ALETHEIA_BASE_URL=http://localhost:3000
# Ory Hydra OAuth2
ORY_CLIENT_ID=your_client_id
ORY_CLIENT_SECRET=your_secret
ORY_SCOPE=openid offline_access
# Optional
RECAPTCHA_TOKEN=
EXTRACTION_INTERVAL_MINUTES=30
MINIMUM_SAVE_SCORE=20
SUBMISSION_SCORE_THRESHOLD=38
AUTO_SUBMIT_ENABLED=false
MAX_BATCH_SUBMISSION=100Sources
- GET /api/sources
- POST /api/sources
- PUT /api/sources/{id}
- DELETE /api/sources/{id}
- POST /api/sources/{id}/extract
Content
- GET /api/content
- GET /api/content/{id}
- POST /api/content/{id}/submit
- DELETE /api/content/{id}
Integration
- POST /api/aletheia/submit-pending
Trigger extraction:
curl -X POST http://localhost:8000/api/sources/{source_id}/extractSubmit pending:
curl -X POST http://localhost:8000/api/aletheia/submit-pending?limit=100View logs:
docker-compose logs -f apiMongoDB shell:
docker-compose exec mongodb mongosh monitoring_pocSubmitted to AletheiaFact as:
{
"content": "Article text...",
"receptionChannel": "automated_monitoring",
"reportType": "claim",
"impactArea": {"label": "Politics", "value": "politics"},
"source": [{"href": "https://source-url.com"}],
"publicationDate": "2025-01-15T10:30:00",
"date": "2025-01-15T11:00:00",
"heardFrom": "Automated Monitoring - Source Name",
"recaptcha": "optional_token"
}Impact areas detected via keywords: Politics, Health, Science, General.
Two-layer efficient deduplication:
-
Early URL Check (90% processing reduction)
- Normalize URL (remove tracking params, upgrade http→https)
- Check indexed
sourceUrlbefore NLP processing - Skip duplicate entries immediately
-
Content Hash Fallback
- SHA-256 hash of
url + normalized_content - Catches same article on different URLs
- SHA-256 hash of
Performance: RSS feeds fetched normally, but duplicate entries skip claim extraction, scoring, and language detection.
- pending: Awaiting submission (score ≥ 38)
- submitted: Successfully sent to AletheiaFact
- rejected: Below score threshold
- failed: Submission error (retryable)
No extractions: Check scheduler logs, verify sources are active, manually trigger extraction
Submission failures: Review OAuth2 config, inspect failed content errors in dashboard
Low submission rate: Review pre-filter scores in /api/stats, verify credibility levels
Duplicates: Automatically handled via URL normalization + indexed checks