A sophisticated AI-powered email scam detection system built with PyTorch, Flask, and advanced NLP techniques. This project uses deep learning with bidirectional LSTM, attention mechanisms, and feature engineering to accurately identify phishing and scam emails.
- Overview
- Features
- Architecture
- Installation
- Usage
- Project Structure
- Model Details
- Feature Engineering
- Training Process
- Web Interface
- Performance
- Technologies Used
- Future Improvements
- Contributing
- License
This email scam detection system leverages state-of-the-art deep learning techniques to protect users from phishing attacks, fraudulent emails, and scam attempts. The system analyzes email content using a combination of:
- Deep Learning: Bidirectional LSTM with attention mechanism
- Feature Engineering: Custom scam keyword detection, URL analysis, urgency detection, and personal information request identification
- Smart Thresholding: Optimized classification threshold for balanced accuracy
- User-Friendly Interface: Clean web UI showing classification results and confidence scores
-
Advanced Neural Network Architecture
- Bidirectional LSTM for context understanding
- Attention mechanism for focusing on important email parts
- 300-token sequence length for comprehensive analysis
- 15,000-word vocabulary
-
Multi-Layer Detection
- Deep learning model predictions
- Scam keyword detection with weighted scoring
- Suspicious URL pattern recognition
- Urgency tactic identification (ALL CAPS, excessive punctuation)
- Personal information request detection
-
Robust Features
- Class weighting for imbalanced datasets
- Stratified train/validation/test splits
- Early stopping and learning rate scheduling
- Dropout and L2 regularization to prevent overfitting
- GPU acceleration support
-
Web Interface
- Simple, intuitive design
- Real-time email analysis
- Confidence score display
- Adjustable detection threshold
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ User Interface (Flask) โ
โ templates/index_improved.html โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Flask Application (app_improved.py) โ
โ - Request handling โ
โ - Model loading โ
โ - Response formatting โ
โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Detection Engine (detect_scam_improved.py) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Text Preprocessing โ โ
โ โ - Clean HTML/special chars โ โ
โ โ - Normalize whitespace โ โ
โ โ - Tokenization โ โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Feature Extraction โ โ
โ โ - Scam keywords (weighted) โ โ
โ โ - Suspicious URLs โ โ
โ โ - Urgency indicators โ โ
โ โ - Personal info requests โ โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Neural Network Prediction โ โ
โ โ - Bidirectional LSTM โ โ
โ โ - Attention mechanism โ โ
โ โ - Dense layers with dropout โ โ
โ โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Final Classification โ โ
โ โ - Combine model + features โ โ
โ โ - Apply threshold โ โ
โ โ - Calculate confidence โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input (300 tokens)
โ
โผ
Embedding Layer (128 dimensions)
โ
โผ
Bidirectional LSTM (128 units ร 2)
โ
โผ
Attention Mechanism
โ
โผ
Dense Layer (64 units) + ReLU + Dropout(0.5)
โ
โผ
Dense Layer (32 units) + ReLU + Dropout(0.3)
โ
โผ
Output Layer (1 unit) + Sigmoid
โ
โผ
Probability (0.0 - 1.0)
- Python 3.8 or higher
- pip package manager
- (Optional) CUDA-capable GPU for faster training
-
Clone or download the project
cd "c:\Users\JHASHANK\Desktop\JHASHANK\ACHARYA\SEMESTER 5\MINI PROJECT\model_4"
-
Install dependencies
pip install -r requirements.txt
-
Verify installation
python gpu_test.py # Check if GPU is available python test_compatibility.py # Test PyTorch compatibility
torch>=2.0.0
pandas
numpy
scikit-learn
flask
Train the model on the Enron fraud dataset:
python train_improved_model_pytorch.pyTraining outputs:
email_classification_model_pytorch.pth- Trained model weightstokenizer_pytorch.pkl- Fitted tokenizervocab_pytorch.json- Vocabulary mappingoptimal_threshold.txt- Best classification threshold
Training features:
- Stratified train/validation/test split (70%/15%/15%)
- Class weighting for imbalanced data
- Early stopping (patience: 5 epochs)
- Learning rate reduction on plateau
- Best model checkpoint saving
Start the Flask web server:
python app_improved.pyAccess the application at: http://localhost:5000
Test individual emails from the command line:
python detect_scam_improved.pyInteractive mode:
- Enter email text when prompted
- Receive classification and confidence score
- Type 'quit' to exit
model_4/
โ
โโโ app_improved.py # Flask web application
โโโ detect_scam_improved.py # Detection engine with feature extraction
โโโ train_improved_model_pytorch.py # Model training script
โ
โโโ templates/
โ โโโ index_improved.html # Web interface
โ
โโโ enron_data_fraud_labeled.csv # Training dataset (Enron emails)
โ
โโโ email_classification_model_pytorch.pth # Trained model (generated)
โโโ tokenizer_pytorch.pkl # Tokenizer (generated)
โโโ vocab_pytorch.json # Vocabulary (generated)
โโโ optimal_threshold.txt # Optimal threshold (generated)
โ
โโโ requirements.txt # Python dependencies
โโโ gpu_test.py # GPU availability test
โโโ test_compatibility.py # PyTorch compatibility test
โ
โโโ README.md # This file
| Parameter | Value | Description |
|---|---|---|
| Max Sequence Length | 300 | Maximum tokens per email |
| Vocabulary Size | 15,000 | Unique words in vocabulary |
| Embedding Dimension | 128 | Word embedding size |
| LSTM Units | 128 | Hidden units per direction |
| Batch Size | 64 | Training batch size |
| Learning Rate | 0.001 | Initial learning rate |
| Epochs | 20 | Maximum training epochs |
| Dropout Rate | 0.5 / 0.3 | Regularization dropout |
-
Embedding Layer
- Converts word indices to dense vectors
- Dimension: 128
- Trainable
-
Bidirectional LSTM
- Processes sequences forward and backward
- Hidden units: 128 per direction (256 total)
- Captures long-term dependencies
-
Attention Mechanism
- Focuses on important parts of the email
- Computes weighted sum of LSTM outputs
- Improves interpretability
-
Dense Layers
- Layer 1: 64 units with ReLU + 50% dropout
- Layer 2: 32 units with ReLU + 30% dropout
- Output: 1 unit with sigmoid activation
-
Regularization
- Dropout layers to prevent overfitting
- L2 weight decay (0.0001)
- Early stopping
Weighted keyword scoring system:
SCAM_KEYWORDS = {
'urgent': 3.0,
'verify': 2.5,
'suspended': 3.0,
'credit card': 3.0,
'social security': 3.5,
'ssn': 3.5,
'winner': 3.0,
'prize': 3.0,
'irs': 3.0,
'fbi': 3.5,
# ... and more
}Score calculation: Sum of (keyword weight ร occurrences) / email length
Detects suspicious URLs including:
- Non-trusted domains
- URL shorteners (bit.ly, tinyurl, etc.)
- Typosquatting (paypa1, amaz0n, etc.)
- Suspicious TLDs (.xyz, .top, .tk, etc.)
- IP addresses in URLs
- Multiple dots/dashes (obfuscation)
Identifies pressure tactics:
- ALL CAPS TEXT (>30% of text)
- Excessive punctuation (!!!, ???)
- Urgency phrases ("act now", "limited time")
Flags requests for:
- Passwords
- Credit card numbers
- Social Security Numbers (SSN)
- Bank account details
final_score = (
0.50 ร model_prediction +
0.20 ร keyword_score +
0.15 ร url_score +
0.10 ร urgency_score +
0.05 ร personal_info_score
)-
Load Enron Dataset
- CSV format with email text and fraud labels
- Combines Subject + Body for full context
-
Text Preprocessing
- Remove HTML tags, URLs, email addresses
- Normalize whitespace and special characters
- Convert to lowercase
-
Tokenization
- Build vocabulary from training data
- Convert text to sequences of integers
- Pad/truncate to fixed length (300)
-
Class Balancing
- Compute class weights for imbalanced data
- Apply weights during training
For each epoch:
1. Train on training batches
2. Validate on validation set
3. Calculate metrics (loss, accuracy)
4. Apply learning rate scheduler
5. Check early stopping condition
6. Save best model checkpoint
- Loss Function: Binary Cross-Entropy with Logits
- Optimizer: Adam (lr=0.001, weight_decay=0.0001)
- Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=3)
- Early Stopping: Patience of 5 epochs
- Clean, Modern Design: Purple gradient theme with responsive layout
- Email Input: Large textarea for pasting email content
- Threshold Control: Adjustable detection sensitivity
- Results Display:
- SAFE/UNSAFE classification
- Model confidence percentage
- Color-coded results (green for safe, red for unsafe)
- Paste email text (subject, headers, body)
- (Optional) Adjust detection threshold (default: 0.5)
- Click "Analyze Email"
- View classification result and confidence
Based on the Enron fraud dataset:
- Training Accuracy: ~95-97%
- Validation Accuracy: ~92-94%
- Test Accuracy: ~90-93%
- Precision: High (minimizes false positives)
- Recall: Balanced (catches most scams)
- F1-Score: Optimized through threshold tuning
The optimal threshold is computed by:
- Generating predictions on validation set
- Testing thresholds from 0.1 to 0.9
- Selecting threshold with best F1-score
- Saving to
optimal_threshold.txt
- Python 3.8+: Main programming language
- PyTorch 2.0+: Deep learning framework
- Flask: Web framework
- NumPy: Numerical computations
- Pandas: Data manipulation
- Scikit-learn: ML utilities and metrics
torch.nn: Neural network modulestorch.optim: Optimization algorithmssklearn.model_selection: Data splittingsklearn.metrics: Performance evaluationsklearn.utils.class_weight: Class balancing
-
Model Architecture
- Transformer-based models (BERT, RoBERTa)
- Multi-head attention
- Larger pre-trained embeddings (GloVe, Word2Vec)
-
Features
- Email header analysis (SPF, DKIM, DMARC)
- Sender reputation scoring
- Domain age verification
- Link destination checking
- Attachment analysis
-
Data
- Expand training dataset
- Real-time data collection
- Multi-language support
- Email thread context
-
User Interface
- Browser extension
- Email client plugins (Gmail, Outlook)
- Mobile app
- API for integration
-
Performance
- Model quantization for faster inference
- Edge deployment
- Real-time streaming analysis
- Batch processing mode
Contributions are welcome! Here's how you can help:
- Report Bugs: Open an issue with details
- Suggest Features: Propose new functionality
- Submit Pull Requests:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a PR with description
- Follow PEP 8 style guide
- Add docstrings to functions
- Include comments for complex logic
- Test changes before submitting
This project is created for educational purposes as part of a semester mini-project.
JHASHANK ACHARYA
- Semester 5 Mini Project
- Email Scam Detection System
- Enron Dataset: Used for training and evaluation
- PyTorch Community: Excellent documentation and tutorials
- Flask: Simple and powerful web framework
- Open Source Contributors: Libraries and tools that made this possible
For questions or issues:
- Check existing documentation
- Review code comments
- Open an issue on the repository
- Contact the project maintainer
This system is designed to assist in identifying scam emails but should not be the sole method of protection. Always:
- Verify sender email addresses
- Be cautious with unexpected emails
- Never share sensitive information via email
- Use official websites (not email links) for account access
- Enable two-factor authentication
- Keep software updated
- Initial release
- Bidirectional LSTM with attention
- Feature engineering system
- Flask web interface
- Optimized threshold detection
Last Updated: November 17, 2025