Update data 2025 men 2024 women#16
Conversation
- Reorganize repository structure with src/ layout - Rename Downloader class to Scraper for better clarity - Rename Plotter class to Visualizer for better clarity - Fix websockets compatibility issue (pin to ~8.1) - Add support for 2025 men's Tour de France data - Add support for 2024 women's Tour de France data - Implement domain-aware URL handling for men's vs women's sites - Fix decimal distance parsing for accurate data extraction - Remove obsolete Quarto website files - Add comprehensive download script for both men's and women's data
- Implement GitHub Actions workflow to prevent data deletion - Check that CSV files only grow (no row/column deletions) - Verify data integrity (no modifications to existing data) - Allow only additions of new rows and columns - Provide detailed error messages and fix suggestions - Include comprehensive documentation and examples Safety measures: ✅ Row count protection (only increases allowed) ✅ Column count protection (only increases allowed) ✅ Data integrity verification (no modifications) ✅ Automatic PR blocking on violations ✅ Clear error reporting with fix suggestions
🗂️ Data Organization: - Move CSV files to data/men/ and data/women/ subfolders - Move plots to data/plots/ folder - Update all file paths and references 🛠️ Makefile Integration: - Add comprehensive Makefile with common commands - make update: Download data only (no plots) - make plot: Generate plots from existing data - make all: Run both update and plot - make install/test/lint/format: Development workflow - make check-csv: Validate data integrity - make migrate-data: One-time data migration helper 📊 Separate Plot Generation: - Create dedicated scripts/generate_plots.py - Decouple data download from plot generation - Support independent plot regeneration 📚 Updated Documentation: - Comprehensive README with new structure - Updated file paths and examples - Added development and contribution guidelines - Document 2025 men's and 2024 women's data coverage 🛡️ Enhanced Data Protection: - Update CSV protection to handle subdirectories - Support data/men/*.csv and data/women/*.csv patterns - Maintain integrity checking across folder structure 🔧 Development Improvements: - Update poetry.lock with latest dependencies - Clean separation of concerns (download vs plot) - Better developer experience with make commands
- Update documentation and scripts based on manual refinements - Ensure all file paths and references are correctly updated - Final polish for repository organization and tooling
🔧 Root Cause Fixed: - The _get_urls method was reversing the list of matches ([::-1]) - This caused the scraper to process oldest data first instead of newest - Latest years (2025 men's, 2024 women's) were available but processed last ✅ Solution Applied: - Removed the list reversal in _get_urls method - Now processes most recent years first - 2025 men's data: 3323km, 21 stages, 160 riders - 2024 women's data: 946km, 8 stages, 110 riders 🧪 Validation: - Verified both datasets are now retrieved correctly - Men's 2025 Tour de France (Tadej Pogačar winner) ✅ - Women's 2024 Tour de France (Katarzyna Niewiadoma-Phinney winner) ✅ - All data integrity maintained for historical years
- Add 2025 Tour de France men's data (Tadej Pogačar winner) - Add 2024 Tour de France Femmes women's data (Katarzyna Niewiadoma winner) - Update all historical data files with latest information - Add debug scripts for data validation and testing - Fix websockets compatibility issue (pinned to ~8.1) - Remove list reversal in URL processing for proper chronological order - Create comprehensive debugging tools for future maintenance Data coverage: - Men's: 1903-2025 (122 years complete) - Women's: 2022-2024 (3 years since restart complete)
- Move test_recent_download.py from scripts/ to tests/ - Move test_recent_links.py from scripts/ to tests/ - Move debug_latest_data.py from scripts/ to tests/ - Add tests/__init__.py to make it a proper Python package - Update Makefile to include tests/ in linting and formatting commands This better separates testing/debugging utilities from main scripts, improving project organization and following Python best practices.
- Replace pytest command with direct execution of test files - Run test_recent_links.py, test_recent_download.py, and debug_latest_data.py - All tests now work properly from the tests/ directory
✨ Features: - Add DataPostProcessor class in src/letourdataset/postprocessor.py - Add postprocess_data.py script for running postprocessor - Integrate postprocessing into Makefile pipeline (runs after data download) 📊 Data Organization: - TDFF_All_Rankings_History.csv: Sort by Year, Stages, Rank (all numeric, first 3 columns) - TDFF_Riders_History.csv: Sort by Year, Rank (both numeric, first 2 columns) - TDFF_Stages_History.csv: Sort by Year, Rank (both numeric, first 2 columns) - TDF_* files: Same logic applied to men's data 🔧 Pipeline Integration: - make update: Now includes automatic postprocessing after download - make postprocess: Standalone command for manual postprocessing - All data files automatically sorted and organized before any commit
- Update pyproject.toml to properly specify src/ package location - Remove manual sys.path manipulation from all scripts and tests - Add VS Code settings for better IDE support with src/ structure - Add IDE folders to .gitignore for cleaner repo - All scripts now use clean imports via poetry install This resolves the red squiggly import errors in IDEs while maintaining full functionality of all scripts and tests.
- Remove TDF_All_Rankings_History.csv and TDFF_All_Rankings_History.csv from git - Add *All_Rankings_History.csv to .gitignore to prevent future tracking - Files remain locally available but won't be committed to repository - Postprocessor continues to work with these files when present These large files (50MB+) are better kept local to avoid repository bloat.
- Update pyproject.toml with proper src/ package configuration - Clean up all import statements across scripts and tests - Remove manual sys.path manipulation in favor of proper Poetry imports - Add comprehensive .gitignore patterns for macOS and system files - Process and sort all data files with postprocessor pipeline - Ensure all scripts work with clean, professional import structure This finalizes the transition to a modern Python package structure with proper dependency management and clean imports throughout.
- Remove direct numpy dependency from pyproject.toml - Replace all numpy usage with pandas equivalents: * np.array() → .values * np.isnan() → pd.isna() * np.unique() → pd.Series.unique() - Update pandas to v2.3.1 for better compatibility - Add lxml-html-clean for requests-html compatibility - Move jupyter/ipykernel to dev dependencies section - All functionality tested and working correctly - Repository now has cleaner, more focused dependencies
There was a problem hiding this comment.
Pull Request Overview
This pull request implements a comprehensive restructuring and modernization of the LeTourDataSet project to include 2025 men's and 2024 women's Tour de France data. The changes transform the project from a simple script into a well-organized Python package with automated data processing capabilities, modern tooling, and robust data protection workflows.
Key changes include:
- Complete project restructuring with modern Python packaging and organized data folders
- Addition of latest 2025 men's and 2024 women's Tour de France data with comprehensive CSV files
- Implementation of automated data processing pipeline with Makefile commands and post-processing capabilities
Reviewed Changes
Copilot reviewed 28 out of 40 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/letourdataset/scraper.py |
Complete rewrite of data scraping functionality with improved error handling and women's tour support |
src/letourdataset/visualizer.py |
Updated visualization component with dynamic year ranges and improved plotting capabilities |
data/women/TDFF_*.csv |
New comprehensive women's Tour de France datasets for 2022-2024 |
scripts/*.py |
New automated scripts for data downloading, processing, and visualization generation |
Makefile |
Comprehensive automation for data management, testing, and development workflows |
pyproject.toml |
Modern Python packaging configuration with updated dependencies |
README.md |
Complete documentation rewrite with usage examples and development guidelines |
|
|
||
| def plot(self, df: pd.DataFrame, saveas="data/TDF_Distance_And_Pace.png") -> None: | ||
| last_year = 2024 | ||
| last_year = 2030 |
There was a problem hiding this comment.
The hardcoded year 2030 should be made dynamic or configurable to avoid requiring manual updates as years progress.
| last_year = 2030 | |
| last_year = df["Year"].max() |
|
|
||
| async def main(): | ||
| """Download historical Tour de France data for both men's and women's races.""" | ||
| base_folder = "../data" |
There was a problem hiding this comment.
Using relative paths like '../data' can be fragile and fail depending on the working directory. Consider using absolute paths or Path.resolve() for more robust file handling.
| base_folder = "../data" | |
| base_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "data")) |
| df["GapSeconds"] = 0 | ||
| ts = tmp["TotalSeconds"].values | ||
| gs = tmp["GapSeconds"].values | ||
| ts[1:] = ts[0] + gs[1:] |
There was a problem hiding this comment.
The calculation for years 2006 and 1997 uses NumPy-style array operations on pandas Series. This hardcoded logic for specific years should be documented with the reason for this special handling.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
No description provided.