Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]
Copyright 2025 Jeremy Vachier

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down
22 changes: 11 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -25,53 +25,53 @@ help:

# Dependency management
install:
@echo "📦 Installing dependencies with uv..."
@echo "Installing dependencies with uv..."
uv sync --all-extras

# Code quality with Ruff
format:
@echo "🎨 Formatting code with ruff..."
@echo "Formatting code with ruff..."
uv run ruff format src/ dash_app/ tests/ scripts/

lint:
@echo "🔍 Linting code with ruff..."
@echo "Linting code with ruff..."
uv run ruff check . --fix
uv run ruff format --check .

# Type checking
typecheck:
@echo "🔎 Type checking with mypy..."
@echo "Type checking with mypy..."
uv run mypy src/ --ignore-missing-imports

# Security checking
security:
@echo "🔒 Security checking with bandit..."
@echo "Security checking with bandit..."
uv run bandit -r src/ -f json

# Run all quality checks
check-all: lint typecheck security
@echo "All code quality checks completed!"
@echo "All code quality checks completed!"

# Testing
test:
@echo "🧪 Running tests..."
@echo "Running tests..."
uv run pytest tests/ -v

# Pipeline execution
run:
@echo "🚀 Running modular pipeline..."
@echo "Running modular pipeline..."
uv run python src/main_modular.py

# Model training
train-models:
@echo "🤖 Training and saving ML models..."
@echo "Training and saving ML models..."
uv run python scripts/train_and_save_models.py

# Dash application
dash:
@echo "📊 Starting Dash application..."
@echo "Starting Dash application..."
uv run python dash_app/main.py --model-name ensemble

stop-dash:
@echo "🛑 Stopping Dash application..."
@echo "Stopping Dash application..."
@lsof -ti:8050 | xargs kill -9 2>/dev/null || echo "No process found on port 8050"
42 changes: 21 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,18 +63,18 @@ uv run python src/main_modular.py # Run pipeline

```
src/
├── main_modular.py # 🎯 Main production pipeline (MLOps-enhanced)
├── modules/ # 🧩 Core modules
│ ├── config.py # ⚙️ Configuration & logging
│ ├── data_loader.py # 📊 Data loading & external merge
│ ├── preprocessing.py # 🔧 Feature engineering
│ ├── data_augmentation.py # 🎲 Advanced synthetic data
│ ├── model_builders.py # 🏭 Model stack construction
│ ├── ensemble.py # 🎯 Ensemble & OOF predictions
│ ├── optimization.py # 🔍 Optuna utilities
│ └── utils.py # 🛠️ Utility functions

dash_app/ # 🖥️ Interactive Dashboard
├── main_modular.py # Main production pipeline (MLOps-enhanced)
├── modules/ # Core modules
│ ├── config.py # Configuration & logging
│ ├── data_loader.py # Data loading & external merge
│ ├── preprocessing.py # Feature engineering
│ ├── data_augmentation.py # Advanced synthetic data
│ ├── model_builders.py # Model stack construction
│ ├── ensemble.py # Ensemble & OOF predictions
│ ├── optimization.py # Optuna utilities
│ └── utils.py # Utility functions

dash_app/ # Interactive Dashboard
├── dashboard/ # Application source
│ ├── app.py # Main Dash application
│ ├── layout.py # UI layout components
Expand All @@ -84,21 +84,21 @@ dash_app/ # 🖥️ Interactive Dashboard
├── Dockerfile # Container configuration
└── docker-compose.yml # Multi-service orchestration

models/ # 🤖 Trained Models
models/ # Trained Models
├── ensemble_model.pkl # Production ensemble model
├── ensemble_metadata.json # Model metadata and labels
├── stack_*_model.pkl # Individual stack models
└── stack_*_metadata.json # Stack-specific metadata

scripts/ # 🛠️ Utility Scripts
scripts/ # Utility Scripts
└── train_and_save_models.py # Model training and persistence

data/ # 📊 Datasets
data/ # Datasets

docs/ # 📝 Documentation
docs/ # Documentation
└── [Generated documentation] # Technical guides

best_params/ # 💾 Optimized parameters
best_params/ # Optimized parameters
└── stack_*_best_params.json # Per-stack best parameters
```

Expand Down Expand Up @@ -231,19 +231,19 @@ The pipeline employs six specialized ensemble stacks, each optimized for differe
The pipeline is designed to achieve high accuracy through ensemble learning and advanced optimization techniques. Performance will vary based on:

```
📊 Dataset Statistics
Dataset Statistics
├── Training Samples: ~18,000+ (with augmentation)
├── Test Samples: ~6,000+
├── Original Features: 8 personality dimensions
├── Engineered Features: 14+ (with preprocessing)
├── Augmented Samples: Variable (adaptive, typically 5-10%)
└── Class Balance: Extrovert/Introvert classification

🔧 Technical Specifications
Technical Specifications
├── Memory Usage: <4GB peak (configurable)
├── CPU Utilization: 4 cores (configurable)
├── Model Persistence: Best parameters saved
└── Reproducibility: Fixed random seeds
├── Model Persistence: Yes - Best parameters saved
└── Reproducibility: Yes - Fixed random seeds
```

## Testing & Validation
Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ docker build -t personality-classifier .
docker run -p 8080:8080 personality-classifier
```

## 📚 Resources
## Resources

- Code: `src/main_modular.py`, `examples/`
- Config templates: [Configuration Guide](configuration.md)
Expand Down
4 changes: 2 additions & 2 deletions docs/data-augmentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,14 +397,14 @@ def calculate_adaptive_ratio(data_characteristics):

### When to Use Augmentation

**Recommended**:
**Recommended**:

- Small to medium datasets (<10K samples)
- Class imbalanced problems
- High-stakes applications requiring robustness
- When overfitting is detected

**Not Recommended**:
**Not Recommended**:

- Very large datasets (>100K samples)
- When computational resources are limited
Expand Down
6 changes: 3 additions & 3 deletions scripts/train_and_save_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -174,14 +174,14 @@ def main():
setup_logging()
logger = get_logger(__name__)

logger.info("🚀 Starting model training and saving process...")
logger.info("Starting model training and saving process...")

# Create models directory
models_dir = Path("models")
models_dir.mkdir(exist_ok=True)

# Load and prepare data
logger.info("📊 Loading and preparing data...")
logger.info("Loading and preparing data...")
df_tr, df_te, submission = load_data_with_external_merge()

# Preprocess data (prep function expects target column in df_tr)
Expand Down Expand Up @@ -215,7 +215,7 @@ def main():
except Exception as e:
logger.error(f"Failed to train ensemble model: {e}")

logger.info("Model training and saving complete!")
logger.info("Model training and saving complete!")
logger.info(f"Models saved in: {models_dir.absolute()}")


Expand Down
48 changes: 23 additions & 25 deletions src/main_modular.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ def load_and_prepare_data(
testing_mode: bool = True, test_size: int = 1000
) -> TrainingData:
"""Load and prepare training data."""
logger.info("🎯 Six-Stack Personality Classification Pipeline (Modular)")
logger.info("Six-Stack Personality Classification Pipeline (Modular)")
logger.info("=" * 60)

# Load data using advanced merge strategy
Expand All @@ -97,11 +97,11 @@ def load_and_prepare_data(
# FOR TESTING: Limit to specified samples for faster execution
if testing_mode and len(df_tr) > test_size:
logger.info(
f"🔬 TESTING MODE: Limiting dataset to {test_size} samples "
f"TESTING MODE: Limiting dataset to {test_size} samples "
f"(original: {len(df_tr)})"
)
df_tr = df_tr.sample(n=test_size, random_state=RND).reset_index(drop=True)
logger.info(f" 📊 Using {len(df_tr)} samples for testing")
logger.info(f" Using {len(df_tr)} samples for testing")

# Preprocess data with advanced competitive approach (do this first)
X_full, X_test, y_full, le = prep(df_tr, df_te)
Expand Down Expand Up @@ -235,7 +235,7 @@ def train_single_stack(config: StackConfig, data: TrainingData) -> optuna.Study:

def train_all_stacks(data: TrainingData) -> dict[str, optuna.Study]:
"""Train all stacks in the ensemble."""
logger.info("\n🔍 Training 6 specialized stacks...")
logger.info("\nTraining 6 specialized stacks...")

stack_configs = get_stack_configurations()
studies = {}
Expand All @@ -250,7 +250,7 @@ def create_model_builders(
studies: dict[str, optuna.Study], data: TrainingData
) -> dict[str, Callable[[], Any]]:
"""Create model builder functions for each stack."""
logger.info("\n📊 Creating model builders for ensemble...")
logger.info("\nCreating model builders for ensemble...")

builders = {
"A": lambda: build_stack(studies["A"].best_trial, seed=RND, wide_hp=False),
Expand All @@ -274,7 +274,7 @@ def generate_oof_predictions(
builders: dict[str, Callable[[], Any]], data: TrainingData
) -> dict[str, pd.Series]:
"""Generate out-of-fold predictions for all stacks."""
logger.info("\n🔮 Generating out-of-fold predictions...")
logger.info("\nGenerating out-of-fold predictions...")

oof_predictions = {}

Expand Down Expand Up @@ -325,7 +325,7 @@ def optimize_ensemble_blending(
oof_predictions: dict[str, pd.Series], y_full: pd.Series
) -> tuple[dict[str, float], float]:
"""Optimize ensemble blending weights."""
logger.info("\n⚖️ Optimizing ensemble blending...")
logger.info("\nOptimizing ensemble blending...")

study_blend = optuna.create_study(direction="maximize")
blend_objective = create_blend_objective(oof_predictions, y_full)
Expand All @@ -345,7 +345,7 @@ def optimize_ensemble_blending(
"F": best_weights_list[5],
}

logger.info("\n🏆 Best ensemble weights:")
logger.info("\nBest ensemble weights:")
for stack_name, weight in best_weights.items():
logger.info(f" Stack {stack_name}: {weight:.3f}")
logger.info(f"Best CV score: {study_blend.best_value:.6f}")
Expand Down Expand Up @@ -377,7 +377,7 @@ def refit_and_predict(
models["F"].fit(data.X_full, y_full_noisy)

# Generate final predictions
logger.info("\n🎯 Generating final predictions...")
logger.info("\nGenerating final predictions...")
probabilities = {}
for stack_name in ["A", "B", "C", "D", "E", "F"]:
probabilities[stack_name] = models[stack_name].predict_proba(data.X_test)[:, 1]
Expand Down Expand Up @@ -408,11 +408,11 @@ def apply_pseudo_labelling(
) -> TrainingData:
"""Apply pseudo labelling using ensemble predictions."""
if not ENABLE_PSEUDO_LABELLING:
logger.info("🔮 Pseudo labelling disabled")
logger.info("Pseudo labelling disabled")
return data

logger.info(
f"\n🔮 Applying pseudo labelling (threshold={PSEUDO_CONFIDENCE_THRESHOLD}, max_ratio={PSEUDO_MAX_RATIO})..."
f"\nApplying pseudo labelling (threshold={PSEUDO_CONFIDENCE_THRESHOLD}, max_ratio={PSEUDO_MAX_RATIO})..."
)

# First train models to get test predictions for pseudo labelling
Expand Down Expand Up @@ -464,9 +464,7 @@ def apply_pseudo_labelling(

# Create new TrainingData with pseudo labels added
if pseudo_stats["n_pseudo_added"] > 0:
logger.info(
f"✅ Pseudo labelling added {pseudo_stats['n_pseudo_added']} samples"
)
logger.info(f"Pseudo labelling added {pseudo_stats['n_pseudo_added']} samples")

# Create new TrainingData object with enhanced training set
enhanced_data = TrainingData(
Expand All @@ -478,14 +476,14 @@ def apply_pseudo_labelling(
)
return enhanced_data
else:
logger.info("⚠️ No pseudo labels added, using original data")
logger.info("No pseudo labels added, using original data")
return data


def main():
"""Main execution function for the Six-Stack Personality Classification Pipeline."""

logger.info("🚀 Starting Six-Stack Personality Classification Pipeline")
logger.info("Starting Six-Stack Personality Classification Pipeline")

try:
# Load and prepare data
Expand All @@ -494,7 +492,7 @@ def main():
)

logger.info(
f"📊 Loaded data: {len(data.X_full)} training samples, {len(data.X_test)} test samples"
f"Loaded data: {len(data.X_full)} training samples, {len(data.X_test)} test samples"
)

# Train all stacks
Expand All @@ -503,7 +501,7 @@ def main():
# Log stack optimization results
for stack_name, study in studies.items():
logger.info(
f"📈 Stack {stack_name}: Best score = {study.best_value:.6f} ({len(study.trials)} trials)"
f"Stack {stack_name}: Best score = {study.best_value:.6f} ({len(study.trials)} trials)"
)

# Create model builders
Expand All @@ -517,8 +515,8 @@ def main():
oof_predictions, data.y_full
)

logger.info(f"🎯 Best ensemble CV score: {best_cv_score:.6f}")
logger.info(f"⚖️ Ensemble weights: {best_weights}")
logger.info(f"Best ensemble CV score: {best_cv_score:.6f}")
logger.info(f"Ensemble weights: {best_weights}")

# Apply pseudo labelling using ensemble predictions
enhanced_data = apply_pseudo_labelling(builders, best_weights, data)
Expand All @@ -534,12 +532,12 @@ def main():
)

# Print final results
logger.info(f"\n✅ Predictions saved to '{output_file}'")
logger.info(f"📊 Final submission shape: {submission_df.shape}")
logger.info("🎉 Six-stack ensemble pipeline completed successfully!")
logger.info(f"\nPredictions saved to '{output_file}'")
logger.info(f"Final submission shape: {submission_df.shape}")
logger.info("Six-stack ensemble pipeline completed successfully!")

# Print summary
logger.info("\n📋 Summary:")
logger.info("\nSummary:")
logger.info(f" - Training samples: {len(enhanced_data.X_full):,}")
logger.info(f" - Test samples: {len(enhanced_data.X_test):,}")
logger.info(f" - Features: {enhanced_data.X_full.shape[1]}")
Expand All @@ -551,7 +549,7 @@ def main():
logger.info(" - Modular architecture")

except Exception as e:
logger.error(f"Pipeline failed: {e}")
logger.error(f"Pipeline failed: {e}")
raise


Expand Down
Loading