Skip to content

Update data 2025 men 2024 women#16

Merged
thomascamminady merged 18 commits intomasterfrom
Update_data_2025_men_2024_women
Aug 2, 2025
Merged

Update data 2025 men 2024 women#16
thomascamminady merged 18 commits intomasterfrom
Update_data_2025_men_2024_women

Conversation

@thomascamminady
Copy link
Owner

No description provided.

- Reorganize repository structure with src/ layout
- Rename Downloader class to Scraper for better clarity
- Rename Plotter class to Visualizer for better clarity
- Fix websockets compatibility issue (pin to ~8.1)
- Add support for 2025 men's Tour de France data
- Add support for 2024 women's Tour de France data
- Implement domain-aware URL handling for men's vs women's sites
- Fix decimal distance parsing for accurate data extraction
- Remove obsolete Quarto website files
- Add comprehensive download script for both men's and women's data
- Implement GitHub Actions workflow to prevent data deletion
- Check that CSV files only grow (no row/column deletions)
- Verify data integrity (no modifications to existing data)
- Allow only additions of new rows and columns
- Provide detailed error messages and fix suggestions
- Include comprehensive documentation and examples

Safety measures:
✅ Row count protection (only increases allowed)
✅ Column count protection (only increases allowed)
✅ Data integrity verification (no modifications)
✅ Automatic PR blocking on violations
✅ Clear error reporting with fix suggestions
🗂️ Data Organization:
- Move CSV files to data/men/ and data/women/ subfolders
- Move plots to data/plots/ folder
- Update all file paths and references

🛠️ Makefile Integration:
- Add comprehensive Makefile with common commands
- make update: Download data only (no plots)
- make plot: Generate plots from existing data
- make all: Run both update and plot
- make install/test/lint/format: Development workflow
- make check-csv: Validate data integrity
- make migrate-data: One-time data migration helper

📊 Separate Plot Generation:
- Create dedicated scripts/generate_plots.py
- Decouple data download from plot generation
- Support independent plot regeneration

📚 Updated Documentation:
- Comprehensive README with new structure
- Updated file paths and examples
- Added development and contribution guidelines
- Document 2025 men's and 2024 women's data coverage

🛡️ Enhanced Data Protection:
- Update CSV protection to handle subdirectories
- Support data/men/*.csv and data/women/*.csv patterns
- Maintain integrity checking across folder structure

🔧 Development Improvements:
- Update poetry.lock with latest dependencies
- Clean separation of concerns (download vs plot)
- Better developer experience with make commands
- Update documentation and scripts based on manual refinements
- Ensure all file paths and references are correctly updated
- Final polish for repository organization and tooling
🔧 Root Cause Fixed:
- The _get_urls method was reversing the list of matches ([::-1])
- This caused the scraper to process oldest data first instead of newest
- Latest years (2025 men's, 2024 women's) were available but processed last

✅ Solution Applied:
- Removed the list reversal in _get_urls method
- Now processes most recent years first
- 2025 men's data: 3323km, 21 stages, 160 riders
- 2024 women's data: 946km, 8 stages, 110 riders

🧪 Validation:
- Verified both datasets are now retrieved correctly
- Men's 2025 Tour de France (Tadej Pogačar winner) ✅
- Women's 2024 Tour de France (Katarzyna Niewiadoma-Phinney winner) ✅
- All data integrity maintained for historical years
- Add 2025 Tour de France men's data (Tadej Pogačar winner)
- Add 2024 Tour de France Femmes women's data (Katarzyna Niewiadoma winner)
- Update all historical data files with latest information
- Add debug scripts for data validation and testing
- Fix websockets compatibility issue (pinned to ~8.1)
- Remove list reversal in URL processing for proper chronological order
- Create comprehensive debugging tools for future maintenance

Data coverage:
- Men's: 1903-2025 (122 years complete)
- Women's: 2022-2024 (3 years since restart complete)
- Move test_recent_download.py from scripts/ to tests/
- Move test_recent_links.py from scripts/ to tests/
- Move debug_latest_data.py from scripts/ to tests/
- Add tests/__init__.py to make it a proper Python package
- Update Makefile to include tests/ in linting and formatting commands

This better separates testing/debugging utilities from main scripts,
improving project organization and following Python best practices.
- Replace pytest command with direct execution of test files
- Run test_recent_links.py, test_recent_download.py, and debug_latest_data.py
- All tests now work properly from the tests/ directory
✨ Features:
- Add DataPostProcessor class in src/letourdataset/postprocessor.py
- Add postprocess_data.py script for running postprocessor
- Integrate postprocessing into Makefile pipeline (runs after data download)

📊 Data Organization:
- TDFF_All_Rankings_History.csv: Sort by Year, Stages, Rank (all numeric, first 3 columns)
- TDFF_Riders_History.csv: Sort by Year, Rank (both numeric, first 2 columns)
- TDFF_Stages_History.csv: Sort by Year, Rank (both numeric, first 2 columns)
- TDF_* files: Same logic applied to men's data

🔧 Pipeline Integration:
- make update: Now includes automatic postprocessing after download
- make postprocess: Standalone command for manual postprocessing
- All data files automatically sorted and organized before any commit
- Update pyproject.toml to properly specify src/ package location
- Remove manual sys.path manipulation from all scripts and tests
- Add VS Code settings for better IDE support with src/ structure
- Add IDE folders to .gitignore for cleaner repo
- All scripts now use clean imports via poetry install

This resolves the red squiggly import errors in IDEs while maintaining
full functionality of all scripts and tests.
- Remove TDF_All_Rankings_History.csv and TDFF_All_Rankings_History.csv from git
- Add *All_Rankings_History.csv to .gitignore to prevent future tracking
- Files remain locally available but won't be committed to repository
- Postprocessor continues to work with these files when present

These large files (50MB+) are better kept local to avoid repository bloat.
- Update pyproject.toml with proper src/ package configuration
- Clean up all import statements across scripts and tests
- Remove manual sys.path manipulation in favor of proper Poetry imports
- Add comprehensive .gitignore patterns for macOS and system files
- Process and sort all data files with postprocessor pipeline
- Ensure all scripts work with clean, professional import structure

This finalizes the transition to a modern Python package structure
with proper dependency management and clean imports throughout.
@thomascamminady thomascamminady requested a review from Copilot August 2, 2025 15:31

This comment was marked as outdated.

- Remove direct numpy dependency from pyproject.toml
- Replace all numpy usage with pandas equivalents:
  * np.array() → .values
  * np.isnan() → pd.isna()
  * np.unique() → pd.Series.unique()
- Update pandas to v2.3.1 for better compatibility
- Add lxml-html-clean for requests-html compatibility
- Move jupyter/ipykernel to dev dependencies section
- All functionality tested and working correctly
- Repository now has cleaner, more focused dependencies
@thomascamminady thomascamminady requested a review from Copilot August 2, 2025 16:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request implements a comprehensive restructuring and modernization of the LeTourDataSet project to include 2025 men's and 2024 women's Tour de France data. The changes transform the project from a simple script into a well-organized Python package with automated data processing capabilities, modern tooling, and robust data protection workflows.

Key changes include:

  • Complete project restructuring with modern Python packaging and organized data folders
  • Addition of latest 2025 men's and 2024 women's Tour de France data with comprehensive CSV files
  • Implementation of automated data processing pipeline with Makefile commands and post-processing capabilities

Reviewed Changes

Copilot reviewed 28 out of 40 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/letourdataset/scraper.py Complete rewrite of data scraping functionality with improved error handling and women's tour support
src/letourdataset/visualizer.py Updated visualization component with dynamic year ranges and improved plotting capabilities
data/women/TDFF_*.csv New comprehensive women's Tour de France datasets for 2022-2024
scripts/*.py New automated scripts for data downloading, processing, and visualization generation
Makefile Comprehensive automation for data management, testing, and development workflows
pyproject.toml Modern Python packaging configuration with updated dependencies
README.md Complete documentation rewrite with usage examples and development guidelines


def plot(self, df: pd.DataFrame, saveas="data/TDF_Distance_And_Pace.png") -> None:
last_year = 2024
last_year = 2030
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded year 2030 should be made dynamic or configurable to avoid requiring manual updates as years progress.

Suggested change
last_year = 2030
last_year = df["Year"].max()

Copilot uses AI. Check for mistakes.

async def main():
"""Download historical Tour de France data for both men's and women's races."""
base_folder = "../data"
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using relative paths like '../data' can be fragile and fail depending on the working directory. Consider using absolute paths or Path.resolve() for more robust file handling.

Suggested change
base_folder = "../data"
base_folder = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", "data"))

Copilot uses AI. Check for mistakes.
df["GapSeconds"] = 0
ts = tmp["TotalSeconds"].values
gs = tmp["GapSeconds"].values
ts[1:] = ts[0] + gs[1:]
Copy link

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation for years 2006 and 1997 uses NumPy-style array operations on pandas Series. This hardcoded logic for specific years should be documented with the reason for this special handling.

Copilot uses AI. Check for mistakes.
thomascamminady and others added 2 commits August 2, 2025 19:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@thomascamminady thomascamminady merged commit 31ead7c into master Aug 2, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants