🛡️ Enhanced Email Scam Detection System

A sophisticated AI-powered email scam detection system built with PyTorch, Flask, and advanced NLP techniques. This project uses deep learning with bidirectional LSTM, attention mechanisms, and feature engineering to accurately identify phishing and scam emails.

🎯 Overview

This email scam detection system leverages state-of-the-art deep learning techniques to protect users from phishing attacks, fraudulent emails, and scam attempts. The system analyzes email content using a combination of:

Deep Learning: Bidirectional LSTM with attention mechanism
Feature Engineering: Custom scam keyword detection, URL analysis, urgency detection, and personal information request identification
Smart Thresholding: Optimized classification threshold for balanced accuracy
User-Friendly Interface: Clean web UI showing classification results and confidence scores

✨ Features

Core Capabilities

Advanced Neural Network Architecture
- Bidirectional LSTM for context understanding
- Attention mechanism for focusing on important email parts
- 300-token sequence length for comprehensive analysis
- 15,000-word vocabulary
Multi-Layer Detection
- Deep learning model predictions
- Scam keyword detection with weighted scoring
- Suspicious URL pattern recognition
- Urgency tactic identification (ALL CAPS, excessive punctuation)
- Personal information request detection
Robust Features
- Class weighting for imbalanced datasets
- Stratified train/validation/test splits
- Early stopping and learning rate scheduling
- Dropout and L2 regularization to prevent overfitting
- GPU acceleration support
Web Interface
- Simple, intuitive design
- Real-time email analysis
- Confidence score display
- Adjustable detection threshold

🏗️ Architecture

System Components

┌─────────────────────────────────────────────────────────┐
│                    User Interface (Flask)                │
│                  templates/index_improved.html           │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│              Flask Application (app_improved.py)         │
│                  - Request handling                      │
│                  - Model loading                         │
│                  - Response formatting                   │
└───────────────────────┬─────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────┐
│       Detection Engine (detect_scam_improved.py)         │
│   ┌─────────────────────────────────────────────────┐   │
│   │  Text Preprocessing                             │   │
│   │  - Clean HTML/special chars                     │   │
│   │  - Normalize whitespace                         │   │
│   │  - Tokenization                                 │   │
│   └────────────────┬────────────────────────────────┘   │
│                    ▼                                     │
│   ┌─────────────────────────────────────────────────┐   │
│   │  Feature Extraction                             │   │
│   │  - Scam keywords (weighted)                     │   │
│   │  - Suspicious URLs                              │   │
│   │  - Urgency indicators                           │   │
│   │  - Personal info requests                       │   │
│   └────────────────┬────────────────────────────────┘   │
│                    ▼                                     │
│   ┌─────────────────────────────────────────────────┐   │
│   │  Neural Network Prediction                      │   │
│   │  - Bidirectional LSTM                           │   │
│   │  - Attention mechanism                          │   │
│   │  - Dense layers with dropout                    │   │
│   └────────────────┬────────────────────────────────┘   │
│                    ▼                                     │
│   ┌─────────────────────────────────────────────────┐   │
│   │  Final Classification                           │   │
│   │  - Combine model + features                     │   │
│   │  - Apply threshold                              │   │
│   │  - Calculate confidence                         │   │
│   └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Neural Network Architecture

Input (300 tokens)
    │
    ▼
Embedding Layer (128 dimensions)
    │
    ▼
Bidirectional LSTM (128 units × 2)
    │
    ▼
Attention Mechanism
    │
    ▼
Dense Layer (64 units) + ReLU + Dropout(0.5)
    │
    ▼
Dense Layer (32 units) + ReLU + Dropout(0.3)
    │
    ▼
Output Layer (1 unit) + Sigmoid
    │
    ▼
Probability (0.0 - 1.0)

🚀 Installation

Prerequisites

Python 3.8 or higher
pip package manager
(Optional) CUDA-capable GPU for faster training

Step-by-Step Installation

Clone or download the project

cd "c:\Users\JHASHANK\Desktop\JHASHANK\ACHARYA\SEMESTER 5\MINI PROJECT\model_4"

Install dependencies
```
pip install -r requirements.txt
```

Verify installation

python gpu_test.py  # Check if GPU is available
python test_compatibility.py  # Test PyTorch compatibility

Requirements

torch>=2.0.0
pandas
numpy
scikit-learn
flask

💻 Usage

1. Training the Model

Train the model on the Enron fraud dataset:

python train_improved_model_pytorch.py

Training outputs:

email_classification_model_pytorch.pth - Trained model weights
tokenizer_pytorch.pkl - Fitted tokenizer
vocab_pytorch.json - Vocabulary mapping
optimal_threshold.txt - Best classification threshold

Training features:

Stratified train/validation/test split (70%/15%/15%)
Class weighting for imbalanced data
Early stopping (patience: 5 epochs)
Learning rate reduction on plateau
Best model checkpoint saving

2. Running the Web Application

Start the Flask web server:

python app_improved.py

Access the application at: http://localhost:5000

3. Command-Line Detection

Test individual emails from the command line:

python detect_scam_improved.py

Interactive mode:

Enter email text when prompted
Receive classification and confidence score
Type 'quit' to exit

📁 Project Structure

model_4/
│
├── app_improved.py                          # Flask web application
├── detect_scam_improved.py                  # Detection engine with feature extraction
├── train_improved_model_pytorch.py          # Model training script
│
├── templates/
│   └── index_improved.html                  # Web interface
│
├── enron_data_fraud_labeled.csv             # Training dataset (Enron emails)
│
├── email_classification_model_pytorch.pth   # Trained model (generated)
├── tokenizer_pytorch.pkl                    # Tokenizer (generated)
├── vocab_pytorch.json                       # Vocabulary (generated)
├── optimal_threshold.txt                    # Optimal threshold (generated)
│
├── requirements.txt                         # Python dependencies
├── gpu_test.py                              # GPU availability test
├── test_compatibility.py                    # PyTorch compatibility test
│
└── README.md                                # This file

🧠 Model Details

Hyperparameters

Parameter	Value	Description
Max Sequence Length	300	Maximum tokens per email
Vocabulary Size	15,000	Unique words in vocabulary
Embedding Dimension	128	Word embedding size
LSTM Units	128	Hidden units per direction
Batch Size	64	Training batch size
Learning Rate	0.001	Initial learning rate
Epochs	20	Maximum training epochs
Dropout Rate	0.5 / 0.3	Regularization dropout

Model Components

Embedding Layer
- Converts word indices to dense vectors
- Dimension: 128
- Trainable
Bidirectional LSTM
- Processes sequences forward and backward
- Hidden units: 128 per direction (256 total)
- Captures long-term dependencies
Attention Mechanism
- Focuses on important parts of the email
- Computes weighted sum of LSTM outputs
- Improves interpretability
Dense Layers
- Layer 1: 64 units with ReLU + 50% dropout
- Layer 2: 32 units with ReLU + 30% dropout
- Output: 1 unit with sigmoid activation
Regularization
- Dropout layers to prevent overfitting
- L2 weight decay (0.0001)
- Early stopping

🔍 Feature Engineering

1. Scam Keyword Detection

Weighted keyword scoring system:

SCAM_KEYWORDS = {
    'urgent': 3.0,
    'verify': 2.5,
    'suspended': 3.0,
    'credit card': 3.0,
    'social security': 3.5,
    'ssn': 3.5,
    'winner': 3.0,
    'prize': 3.0,
    'irs': 3.0,
    'fbi': 3.5,
    # ... and more
}

Score calculation: Sum of (keyword weight × occurrences) / email length

2. URL Pattern Analysis

Detects suspicious URLs including:

Non-trusted domains
URL shorteners (bit.ly, tinyurl, etc.)
Typosquatting (paypa1, amaz0n, etc.)
Suspicious TLDs (.xyz, .top, .tk, etc.)
IP addresses in URLs
Multiple dots/dashes (obfuscation)

3. Urgency Detection

Identifies pressure tactics:

ALL CAPS TEXT (>30% of text)
Excessive punctuation (!!!, ???)
Urgency phrases ("act now", "limited time")

4. Personal Information Requests

Flags requests for:

Passwords
Credit card numbers
Social Security Numbers (SSN)
Bank account details

Final Score Calculation

final_score = (
    0.50 × model_prediction +
    0.20 × keyword_score +
    0.15 × url_score +
    0.10 × urgency_score +
    0.05 × personal_info_score
)

🎓 Training Process

Data Preparation

Load Enron Dataset
- CSV format with email text and fraud labels
- Combines Subject + Body for full context
Text Preprocessing
- Remove HTML tags, URLs, email addresses
- Normalize whitespace and special characters
- Convert to lowercase
Tokenization
- Build vocabulary from training data
- Convert text to sequences of integers
- Pad/truncate to fixed length (300)
Class Balancing
- Compute class weights for imbalanced data
- Apply weights during training

Training Loop

For each epoch:
    1. Train on training batches
    2. Validate on validation set
    3. Calculate metrics (loss, accuracy)
    4. Apply learning rate scheduler
    5. Check early stopping condition
    6. Save best model checkpoint

Optimization

Loss Function: Binary Cross-Entropy with Logits
Optimizer: Adam (lr=0.001, weight_decay=0.0001)
Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=3)
Early Stopping: Patience of 5 epochs

🌐 Web Interface

Features

Clean, Modern Design: Purple gradient theme with responsive layout
Email Input: Large textarea for pasting email content
Threshold Control: Adjustable detection sensitivity
Results Display:
- SAFE/UNSAFE classification
- Model confidence percentage
- Color-coded results (green for safe, red for unsafe)

Usage Flow

Paste email text (subject, headers, body)
(Optional) Adjust detection threshold (default: 0.5)
Click "Analyze Email"
View classification result and confidence

📊 Performance

Expected Metrics

Based on the Enron fraud dataset:

Training Accuracy: ~95-97%
Validation Accuracy: ~92-94%
Test Accuracy: ~90-93%
Precision: High (minimizes false positives)
Recall: Balanced (catches most scams)
F1-Score: Optimized through threshold tuning

Threshold Optimization

The optimal threshold is computed by:

Generating predictions on validation set
Testing thresholds from 0.1 to 0.9
Selecting threshold with best F1-score
Saving to optimal_threshold.txt

🛠️ Technologies Used

Core Technologies

Python 3.8+: Main programming language
PyTorch 2.0+: Deep learning framework
Flask: Web framework
NumPy: Numerical computations
Pandas: Data manipulation
Scikit-learn: ML utilities and metrics

Key Libraries

torch.nn: Neural network modules
torch.optim: Optimization algorithms
sklearn.model_selection: Data splitting
sklearn.metrics: Performance evaluation
sklearn.utils.class_weight: Class balancing

🔮 Future Improvements

Planned Enhancements

Model Architecture
- Transformer-based models (BERT, RoBERTa)
- Multi-head attention
- Larger pre-trained embeddings (GloVe, Word2Vec)
Features
- Email header analysis (SPF, DKIM, DMARC)
- Sender reputation scoring
- Domain age verification
- Link destination checking
- Attachment analysis
Data
- Expand training dataset
- Real-time data collection
- Multi-language support
- Email thread context
User Interface
- Browser extension
- Email client plugins (Gmail, Outlook)
- Mobile app
- API for integration
Performance
- Model quantization for faster inference
- Edge deployment
- Real-time streaming analysis
- Batch processing mode

🤝 Contributing

Contributions are welcome! Here's how you can help:

Report Bugs: Open an issue with details
Suggest Features: Propose new functionality
Submit Pull Requests:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a PR with description

Development Guidelines

Follow PEP 8 style guide
Add docstrings to functions
Include comments for complex logic
Test changes before submitting

📄 License

This project is created for educational purposes as part of a semester mini-project.

👥 Authors

JHASHANK ACHARYA

Semester 5 Mini Project
Email Scam Detection System

🙏 Acknowledgments

Enron Dataset: Used for training and evaluation
PyTorch Community: Excellent documentation and tutorials
Flask: Simple and powerful web framework
Open Source Contributors: Libraries and tools that made this possible

📞 Support

For questions or issues:

Check existing documentation
Review code comments
Open an issue on the repository
Contact the project maintainer

🔐 Security Note

This system is designed to assist in identifying scam emails but should not be the sole method of protection. Always:

Verify sender email addresses
Be cautious with unexpected emails
Never share sensitive information via email
Use official websites (not email links) for account access
Enable two-factor authentication
Keep software updated

📈 Version History

Version 1.0 (Current)

Initial release
Bidirectional LSTM with attention
Feature engineering system
Flask web interface
Optimized threshold detection

Last Updated: November 17, 2025

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
templates		templates
.gitignore		.gitignore
README.md		README.md
app_improved.py		app_improved.py
detect_scam_improved.py		detect_scam_improved.py
gpu_test.py		gpu_test.py
requirements.txt		requirements.txt
test_compatibility.py		test_compatibility.py
train_improved_model_pytorch.py		train_improved_model_pytorch.py

NYN-05/MiniProject

Folders and files

Latest commit

History

Repository files navigation