Update data 2025 men 2024 women by thomascamminady · Pull Request #16 · thomascamminady/LeTourDataSet

thomascamminady · 2025-08-02T13:45:45Z

No description provided.

- Reorganize repository structure with src/ layout - Rename Downloader class to Scraper for better clarity - Rename Plotter class to Visualizer for better clarity - Fix websockets compatibility issue (pin to ~8.1) - Add support for 2025 men's Tour de France data - Add support for 2024 women's Tour de France data - Implement domain-aware URL handling for men's vs women's sites - Fix decimal distance parsing for accurate data extraction - Remove obsolete Quarto website files - Add comprehensive download script for both men's and women's data

- Implement GitHub Actions workflow to prevent data deletion - Check that CSV files only grow (no row/column deletions) - Verify data integrity (no modifications to existing data) - Allow only additions of new rows and columns - Provide detailed error messages and fix suggestions - Include comprehensive documentation and examples Safety measures: ✅ Row count protection (only increases allowed) ✅ Column count protection (only increases allowed) ✅ Data integrity verification (no modifications) ✅ Automatic PR blocking on violations ✅ Clear error reporting with fix suggestions

🗂️ Data Organization: - Move CSV files to data/men/ and data/women/ subfolders - Move plots to data/plots/ folder - Update all file paths and references 🛠️ Makefile Integration: - Add comprehensive Makefile with common commands - make update: Download data only (no plots) - make plot: Generate plots from existing data - make all: Run both update and plot - make install/test/lint/format: Development workflow - make check-csv: Validate data integrity - make migrate-data: One-time data migration helper 📊 Separate Plot Generation: - Create dedicated scripts/generate_plots.py - Decouple data download from plot generation - Support independent plot regeneration 📚 Updated Documentation: - Comprehensive README with new structure - Updated file paths and examples - Added development and contribution guidelines - Document 2025 men's and 2024 women's data coverage 🛡️ Enhanced Data Protection: - Update CSV protection to handle subdirectories - Support data/men/*.csv and data/women/*.csv patterns - Maintain integrity checking across folder structure 🔧 Development Improvements: - Update poetry.lock with latest dependencies - Clean separation of concerns (download vs plot) - Better developer experience with make commands

- Update documentation and scripts based on manual refinements - Ensure all file paths and references are correctly updated - Final polish for repository organization and tooling

🔧 Root Cause Fixed: - The _get_urls method was reversing the list of matches ([::-1]) - This caused the scraper to process oldest data first instead of newest - Latest years (2025 men's, 2024 women's) were available but processed last ✅ Solution Applied: - Removed the list reversal in _get_urls method - Now processes most recent years first - 2025 men's data: 3323km, 21 stages, 160 riders - 2024 women's data: 946km, 8 stages, 110 riders 🧪 Validation: - Verified both datasets are now retrieved correctly - Men's 2025 Tour de France (Tadej Pogačar winner) ✅ - Women's 2024 Tour de France (Katarzyna Niewiadoma-Phinney winner) ✅ - All data integrity maintained for historical years

- Add 2025 Tour de France men's data (Tadej Pogačar winner) - Add 2024 Tour de France Femmes women's data (Katarzyna Niewiadoma winner) - Update all historical data files with latest information - Add debug scripts for data validation and testing - Fix websockets compatibility issue (pinned to ~8.1) - Remove list reversal in URL processing for proper chronological order - Create comprehensive debugging tools for future maintenance Data coverage: - Men's: 1903-2025 (122 years complete) - Women's: 2022-2024 (3 years since restart complete)

- Move test_recent_download.py from scripts/ to tests/ - Move test_recent_links.py from scripts/ to tests/ - Move debug_latest_data.py from scripts/ to tests/ - Add tests/__init__.py to make it a proper Python package - Update Makefile to include tests/ in linting and formatting commands This better separates testing/debugging utilities from main scripts, improving project organization and following Python best practices.

- Replace pytest command with direct execution of test files - Run test_recent_links.py, test_recent_download.py, and debug_latest_data.py - All tests now work properly from the tests/ directory

✨ Features: - Add DataPostProcessor class in src/letourdataset/postprocessor.py - Add postprocess_data.py script for running postprocessor - Integrate postprocessing into Makefile pipeline (runs after data download) 📊 Data Organization: - TDFF_All_Rankings_History.csv: Sort by Year, Stages, Rank (all numeric, first 3 columns) - TDFF_Riders_History.csv: Sort by Year, Rank (both numeric, first 2 columns) - TDFF_Stages_History.csv: Sort by Year, Rank (both numeric, first 2 columns) - TDF_* files: Same logic applied to men's data 🔧 Pipeline Integration: - make update: Now includes automatic postprocessing after download - make postprocess: Standalone command for manual postprocessing - All data files automatically sorted and organized before any commit

- Update pyproject.toml to properly specify src/ package location - Remove manual sys.path manipulation from all scripts and tests - Add VS Code settings for better IDE support with src/ structure - Add IDE folders to .gitignore for cleaner repo - All scripts now use clean imports via poetry install This resolves the red squiggly import errors in IDEs while maintaining full functionality of all scripts and tests.

- Remove TDF_All_Rankings_History.csv and TDFF_All_Rankings_History.csv from git - Add *All_Rankings_History.csv to .gitignore to prevent future tracking - Files remain locally available but won't be committed to repository - Postprocessor continues to work with these files when present These large files (50MB+) are better kept local to avoid repository bloat.

- Update pyproject.toml with proper src/ package configuration - Clean up all import statements across scripts and tests - Remove manual sys.path manipulation in favor of proper Poetry imports - Add comprehensive .gitignore patterns for macOS and system files - Process and sort all data files with postprocessor pipeline - Ensure all scripts work with clean, professional import structure This finalizes the transition to a modern Python package structure with proper dependency management and clean imports throughout.

- Remove direct numpy dependency from pyproject.toml - Replace all numpy usage with pandas equivalents: * np.array() → .values * np.isnan() → pd.isna() * np.unique() → pd.Series.unique() - Update pandas to v2.3.1 for better compatibility - Add lxml-html-clean for requests-html compatibility - Move jupyter/ipykernel to dev dependencies section - All functionality tested and working correctly - Repository now has cleaner, more focused dependencies

Copilot

Pull Request Overview

This pull request implements a comprehensive restructuring and modernization of the LeTourDataSet project to include 2025 men's and 2024 women's Tour de France data. The changes transform the project from a simple script into a well-organized Python package with automated data processing capabilities, modern tooling, and robust data protection workflows.

Key changes include:

Complete project restructuring with modern Python packaging and organized data folders
Addition of latest 2025 men's and 2024 women's Tour de France data with comprehensive CSV files
Implementation of automated data processing pipeline with Makefile commands and post-processing capabilities

Reviewed Changes

Copilot reviewed 28 out of 40 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/letourdataset/scraper.py`	Complete rewrite of data scraping functionality with improved error handling and women's tour support
`src/letourdataset/visualizer.py`	Updated visualization component with dynamic year ranges and improved plotting capabilities
`data/women/TDFF_*.csv`	New comprehensive women's Tour de France datasets for 2022-2024
`scripts/*.py`	New automated scripts for data downloading, processing, and visualization generation
`Makefile`	Comprehensive automation for data management, testing, and development workflows
`pyproject.toml`	Modern Python packaging configuration with updated dependencies
`README.md`	Complete documentation rewrite with usage examples and development guidelines

src/letourdataset/scraper.py

Copilot · 2025-08-02T16:36:25Z

src/letourdataset/visualizer.py


    def plot(self, df: pd.DataFrame, saveas="data/TDF_Distance_And_Pace.png") -> None:
-        last_year = 2024
+        last_year = 2030


The hardcoded year 2030 should be made dynamic or configurable to avoid requiring manual updates as years progress.

Suggested change

last_year = 2030

last_year = df["Year"].max()

src/letourdataset/scraper.py

Copilot · 2025-08-02T16:36:26Z

scripts/download_data.py

+
+async def main():
+    """Download historical Tour de France data for both men's and women's races."""
+    base_folder = "../data"


Using relative paths like '../data' can be fragile and fail depending on the working directory. Consider using absolute paths or Path.resolve() for more robust file handling.

Suggested change

base_folder = "../data"

base_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "data"))

Copilot · 2025-08-02T16:36:26Z

src/letourdataset/scraper.py

+                    df["GapSeconds"] = 0
+                ts = tmp["TotalSeconds"].values
+                gs = tmp["GapSeconds"].values
+                ts[1:] = ts[0] + gs[1:]


The calculation for years 2006 and 1997 uses NumPy-style array operations on pandas Series. This hardcoded logic for specific years should be documented with the reason for this special handling.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

thomascamminady added 15 commits August 2, 2025 11:26

Apply manual edits and final improvements

044a57d

- Update documentation and scripts based on manual refinements - Ensure all file paths and references are correctly updated - Final polish for repository organization and tooling

Rename title.

5597303

Run plots.

3ff58f1

Add plots.

91c818b

Update Makefile test command to run individual test files

2cab3b6

- Replace pytest command with direct execution of test files - Run test_recent_links.py, test_recent_download.py, and debug_latest_data.py - All tests now work properly from the tests/ directory

thomascamminady requested a review from Copilot August 2, 2025 15:31

This comment was marked as outdated.

Sign in to view

thomascamminady requested a review from Copilot August 2, 2025 16:35

Copilot AI reviewed Aug 2, 2025

View reviewed changes

thomascamminady and others added 2 commits August 2, 2025 19:54

Update src/letourdataset/scraper.py

a64addf

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/letourdataset/scraper.py

53fe8a2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

thomascamminady merged commit 31ead7c into master Aug 2, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data 2025 men 2024 women#16

Update data 2025 men 2024 women#16
thomascamminady merged 18 commits intomasterfrom
Update_data_2025_men_2024_women

thomascamminady commented Aug 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Aug 2, 2025

Uh oh!

Uh oh!

Copilot AI Aug 2, 2025

Uh oh!

Copilot AI Aug 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	base_folder = "../data"
	base_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "data"))

Conversation

thomascamminady commented Aug 2, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants