Skip to content

Conversation

@JoeXic
Copy link

@JoeXic JoeXic commented Nov 2, 2025

Describe this PR

Add real-time web monitoring dashboard for GAIA validation benchmark with progress tracking and visualization capabilities.

What changed?

  • Added run_gaia_with_monitor.py to run GAIA benchmark with integrated web monitoring
  • Added utils/progress_check/gaia_web_monitor.py - web dashboard for real-time progress tracking
  • Added utils/progress_check/generate_gaia_report.py - report generation utility
  • Updated main.py to support the new monitoring command
  • Web dashboard accessible at http://localhost:8080 during benchmark execution

Why?

Running long benchmarks like GAIA validation requires hours, and users need a way to:

  • Monitor real-time progress without constantly checking logs
  • Visualize task completion status
  • Track performance metrics during execution
  • Generate comprehensive reports after completion

- Add run-gaia-with-monitor command for running benchmark with real-time monitoring
- Add web dashboard for monitoring benchmark progress (gaia_web_monitor.py)
- Add generate_gaia_report.py to utils/progress_check/ for generating task reports
@JoeXic JoeXic closed this Nov 2, 2025
@JoeXic JoeXic reopened this Nov 2, 2025
@JoeXic JoeXic changed the title feat(monitoring): add real-time web dashboard for GAIA benchmark progress feat(monitoring): add real-time web dashboard for monitoring benchmark progress Nov 10, 2025
@JoeXic
Copy link
Author

JoeXic commented Nov 10, 2025

Describe this PR

Refactor monitoring system from GAIA-specific to generic benchmark monitoring, supporting GAIA, FutureX, xbench, and FinSearchComp benchmarks with real-time web dashboards.

What changed?

Core Changes

  • Replaced run_gaia_with_monitor.pyrun_benchmark_with_monitor.py (generic benchmark runner)
  • Replaced utils/progress_check/gaia_web_monitor.pyutils/progress_check/benchmark_monitor.py (generic monitor)
  • Replaced utils/progress_check/generate_gaia_report.pyutils/progress_check/generate_benchmark_report.py (generic report generator)
  • Updated main.py to use the new generic monitoring system
  • Updated utils/progress_check/check_finsearchcomp_progress.py (fixed type annotation)

New Features

  • Auto-detect benchmark type from log folder path
  • Support benchmark-specific metrics:
    • GAIA/FinSearchComp: Correctness evaluation (accuracy)
    • FutureX/xbench: Prediction tracking (prediction rate)
    • FinSearchComp: Task type breakdown (T1/T2/T3) and regional analysis
  • Extract attempt number from log filename for accurate report generation
  • Suppress verbose HTTP logs in web dashboard
  • Automatic port conflict resolution

Documentation

  • Added monitor_guide.md - Web monitoring dashboard guide

Why?

Running long benchmarks (GAIA, FutureX, xbench, FinSearchComp) requires hours, and users need a way to:

  • Monitor real-time progress without constantly checking logs
  • Visualize task completion status with benchmark-specific metrics
  • Track performance metrics during execution (accuracy for GAIA, prediction rate for FutureX/xbench)
  • Generate comprehensive reports after completion
  • Use a unified monitoring system across all benchmarks instead of benchmark-specific solutions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant