Skip to content

Email scam detector combining linguistically informed feature engineering with a PyTorch classifier trained on an Enron-derived labeled corpus. Extracts domain, metadata, stylistic and entropy features to produce interpretable signals and a calibrated confidence score. Users adjust thresholds; outputs: verdict, confidence %, and concise summary.

Notifications You must be signed in to change notification settings

NYN-05/MiniProject

Repository files navigation

๐Ÿ›ก๏ธ Enhanced Email Scam Detection System

A sophisticated AI-powered email scam detection system built with PyTorch, Flask, and advanced NLP techniques. This project uses deep learning with bidirectional LSTM, attention mechanisms, and feature engineering to accurately identify phishing and scam emails.

๐Ÿ“‹ Table of Contents


๐ŸŽฏ Overview

This email scam detection system leverages state-of-the-art deep learning techniques to protect users from phishing attacks, fraudulent emails, and scam attempts. The system analyzes email content using a combination of:

  • Deep Learning: Bidirectional LSTM with attention mechanism
  • Feature Engineering: Custom scam keyword detection, URL analysis, urgency detection, and personal information request identification
  • Smart Thresholding: Optimized classification threshold for balanced accuracy
  • User-Friendly Interface: Clean web UI showing classification results and confidence scores

โœจ Features

Core Capabilities

  1. Advanced Neural Network Architecture

    • Bidirectional LSTM for context understanding
    • Attention mechanism for focusing on important email parts
    • 300-token sequence length for comprehensive analysis
    • 15,000-word vocabulary
  2. Multi-Layer Detection

    • Deep learning model predictions
    • Scam keyword detection with weighted scoring
    • Suspicious URL pattern recognition
    • Urgency tactic identification (ALL CAPS, excessive punctuation)
    • Personal information request detection
  3. Robust Features

    • Class weighting for imbalanced datasets
    • Stratified train/validation/test splits
    • Early stopping and learning rate scheduling
    • Dropout and L2 regularization to prevent overfitting
    • GPU acceleration support
  4. Web Interface

    • Simple, intuitive design
    • Real-time email analysis
    • Confidence score display
    • Adjustable detection threshold

๐Ÿ—๏ธ Architecture

System Components

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    User Interface (Flask)                โ”‚
โ”‚                  templates/index_improved.html           โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              Flask Application (app_improved.py)         โ”‚
โ”‚                  - Request handling                      โ”‚
โ”‚                  - Model loading                         โ”‚
โ”‚                  - Response formatting                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                        โ”‚
                        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚       Detection Engine (detect_scam_improved.py)         โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚  Text Preprocessing                             โ”‚   โ”‚
โ”‚   โ”‚  - Clean HTML/special chars                     โ”‚   โ”‚
โ”‚   โ”‚  - Normalize whitespace                         โ”‚   โ”‚
โ”‚   โ”‚  - Tokenization                                 โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                    โ–ผ                                     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚  Feature Extraction                             โ”‚   โ”‚
โ”‚   โ”‚  - Scam keywords (weighted)                     โ”‚   โ”‚
โ”‚   โ”‚  - Suspicious URLs                              โ”‚   โ”‚
โ”‚   โ”‚  - Urgency indicators                           โ”‚   โ”‚
โ”‚   โ”‚  - Personal info requests                       โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                    โ–ผ                                     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚  Neural Network Prediction                      โ”‚   โ”‚
โ”‚   โ”‚  - Bidirectional LSTM                           โ”‚   โ”‚
โ”‚   โ”‚  - Attention mechanism                          โ”‚   โ”‚
โ”‚   โ”‚  - Dense layers with dropout                    โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                    โ–ผ                                     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚  Final Classification                           โ”‚   โ”‚
โ”‚   โ”‚  - Combine model + features                     โ”‚   โ”‚
โ”‚   โ”‚  - Apply threshold                              โ”‚   โ”‚
โ”‚   โ”‚  - Calculate confidence                         โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Neural Network Architecture

Input (300 tokens)
    โ”‚
    โ–ผ
Embedding Layer (128 dimensions)
    โ”‚
    โ–ผ
Bidirectional LSTM (128 units ร— 2)
    โ”‚
    โ–ผ
Attention Mechanism
    โ”‚
    โ–ผ
Dense Layer (64 units) + ReLU + Dropout(0.5)
    โ”‚
    โ–ผ
Dense Layer (32 units) + ReLU + Dropout(0.3)
    โ”‚
    โ–ผ
Output Layer (1 unit) + Sigmoid
    โ”‚
    โ–ผ
Probability (0.0 - 1.0)

๐Ÿš€ Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager
  • (Optional) CUDA-capable GPU for faster training

Step-by-Step Installation

  1. Clone or download the project

    cd "c:\Users\JHASHANK\Desktop\JHASHANK\ACHARYA\SEMESTER 5\MINI PROJECT\model_4"
  2. Install dependencies

    pip install -r requirements.txt
  3. Verify installation

    python gpu_test.py  # Check if GPU is available
    python test_compatibility.py  # Test PyTorch compatibility

Requirements

torch>=2.0.0
pandas
numpy
scikit-learn
flask

๐Ÿ’ป Usage

1. Training the Model

Train the model on the Enron fraud dataset:

python train_improved_model_pytorch.py

Training outputs:

  • email_classification_model_pytorch.pth - Trained model weights
  • tokenizer_pytorch.pkl - Fitted tokenizer
  • vocab_pytorch.json - Vocabulary mapping
  • optimal_threshold.txt - Best classification threshold

Training features:

  • Stratified train/validation/test split (70%/15%/15%)
  • Class weighting for imbalanced data
  • Early stopping (patience: 5 epochs)
  • Learning rate reduction on plateau
  • Best model checkpoint saving

2. Running the Web Application

Start the Flask web server:

python app_improved.py

Access the application at: http://localhost:5000

3. Command-Line Detection

Test individual emails from the command line:

python detect_scam_improved.py

Interactive mode:

  • Enter email text when prompted
  • Receive classification and confidence score
  • Type 'quit' to exit

๐Ÿ“ Project Structure

model_4/
โ”‚
โ”œโ”€โ”€ app_improved.py                          # Flask web application
โ”œโ”€โ”€ detect_scam_improved.py                  # Detection engine with feature extraction
โ”œโ”€โ”€ train_improved_model_pytorch.py          # Model training script
โ”‚
โ”œโ”€โ”€ templates/
โ”‚   โ””โ”€โ”€ index_improved.html                  # Web interface
โ”‚
โ”œโ”€โ”€ enron_data_fraud_labeled.csv             # Training dataset (Enron emails)
โ”‚
โ”œโ”€โ”€ email_classification_model_pytorch.pth   # Trained model (generated)
โ”œโ”€โ”€ tokenizer_pytorch.pkl                    # Tokenizer (generated)
โ”œโ”€โ”€ vocab_pytorch.json                       # Vocabulary (generated)
โ”œโ”€โ”€ optimal_threshold.txt                    # Optimal threshold (generated)
โ”‚
โ”œโ”€โ”€ requirements.txt                         # Python dependencies
โ”œโ”€โ”€ gpu_test.py                              # GPU availability test
โ”œโ”€โ”€ test_compatibility.py                    # PyTorch compatibility test
โ”‚
โ””โ”€โ”€ README.md                                # This file

๐Ÿง  Model Details

Hyperparameters

Parameter Value Description
Max Sequence Length 300 Maximum tokens per email
Vocabulary Size 15,000 Unique words in vocabulary
Embedding Dimension 128 Word embedding size
LSTM Units 128 Hidden units per direction
Batch Size 64 Training batch size
Learning Rate 0.001 Initial learning rate
Epochs 20 Maximum training epochs
Dropout Rate 0.5 / 0.3 Regularization dropout

Model Components

  1. Embedding Layer

    • Converts word indices to dense vectors
    • Dimension: 128
    • Trainable
  2. Bidirectional LSTM

    • Processes sequences forward and backward
    • Hidden units: 128 per direction (256 total)
    • Captures long-term dependencies
  3. Attention Mechanism

    • Focuses on important parts of the email
    • Computes weighted sum of LSTM outputs
    • Improves interpretability
  4. Dense Layers

    • Layer 1: 64 units with ReLU + 50% dropout
    • Layer 2: 32 units with ReLU + 30% dropout
    • Output: 1 unit with sigmoid activation
  5. Regularization

    • Dropout layers to prevent overfitting
    • L2 weight decay (0.0001)
    • Early stopping

๐Ÿ” Feature Engineering

1. Scam Keyword Detection

Weighted keyword scoring system:

SCAM_KEYWORDS = {
    'urgent': 3.0,
    'verify': 2.5,
    'suspended': 3.0,
    'credit card': 3.0,
    'social security': 3.5,
    'ssn': 3.5,
    'winner': 3.0,
    'prize': 3.0,
    'irs': 3.0,
    'fbi': 3.5,
    # ... and more
}

Score calculation: Sum of (keyword weight ร— occurrences) / email length

2. URL Pattern Analysis

Detects suspicious URLs including:

  • Non-trusted domains
  • URL shorteners (bit.ly, tinyurl, etc.)
  • Typosquatting (paypa1, amaz0n, etc.)
  • Suspicious TLDs (.xyz, .top, .tk, etc.)
  • IP addresses in URLs
  • Multiple dots/dashes (obfuscation)

3. Urgency Detection

Identifies pressure tactics:

  • ALL CAPS TEXT (>30% of text)
  • Excessive punctuation (!!!, ???)
  • Urgency phrases ("act now", "limited time")

4. Personal Information Requests

Flags requests for:

  • Passwords
  • Credit card numbers
  • Social Security Numbers (SSN)
  • Bank account details

Final Score Calculation

final_score = (
    0.50 ร— model_prediction +
    0.20 ร— keyword_score +
    0.15 ร— url_score +
    0.10 ร— urgency_score +
    0.05 ร— personal_info_score
)

๐ŸŽ“ Training Process

Data Preparation

  1. Load Enron Dataset

    • CSV format with email text and fraud labels
    • Combines Subject + Body for full context
  2. Text Preprocessing

    • Remove HTML tags, URLs, email addresses
    • Normalize whitespace and special characters
    • Convert to lowercase
  3. Tokenization

    • Build vocabulary from training data
    • Convert text to sequences of integers
    • Pad/truncate to fixed length (300)
  4. Class Balancing

    • Compute class weights for imbalanced data
    • Apply weights during training

Training Loop

For each epoch:
    1. Train on training batches
    2. Validate on validation set
    3. Calculate metrics (loss, accuracy)
    4. Apply learning rate scheduler
    5. Check early stopping condition
    6. Save best model checkpoint

Optimization

  • Loss Function: Binary Cross-Entropy with Logits
  • Optimizer: Adam (lr=0.001, weight_decay=0.0001)
  • Learning Rate Schedule: ReduceLROnPlateau (factor=0.5, patience=3)
  • Early Stopping: Patience of 5 epochs

๐ŸŒ Web Interface

Features

  • Clean, Modern Design: Purple gradient theme with responsive layout
  • Email Input: Large textarea for pasting email content
  • Threshold Control: Adjustable detection sensitivity
  • Results Display:
    • SAFE/UNSAFE classification
    • Model confidence percentage
    • Color-coded results (green for safe, red for unsafe)

Usage Flow

  1. Paste email text (subject, headers, body)
  2. (Optional) Adjust detection threshold (default: 0.5)
  3. Click "Analyze Email"
  4. View classification result and confidence

๐Ÿ“Š Performance

Expected Metrics

Based on the Enron fraud dataset:

  • Training Accuracy: ~95-97%
  • Validation Accuracy: ~92-94%
  • Test Accuracy: ~90-93%
  • Precision: High (minimizes false positives)
  • Recall: Balanced (catches most scams)
  • F1-Score: Optimized through threshold tuning

Threshold Optimization

The optimal threshold is computed by:

  1. Generating predictions on validation set
  2. Testing thresholds from 0.1 to 0.9
  3. Selecting threshold with best F1-score
  4. Saving to optimal_threshold.txt

๐Ÿ› ๏ธ Technologies Used

Core Technologies

  • Python 3.8+: Main programming language
  • PyTorch 2.0+: Deep learning framework
  • Flask: Web framework
  • NumPy: Numerical computations
  • Pandas: Data manipulation
  • Scikit-learn: ML utilities and metrics

Key Libraries

  • torch.nn: Neural network modules
  • torch.optim: Optimization algorithms
  • sklearn.model_selection: Data splitting
  • sklearn.metrics: Performance evaluation
  • sklearn.utils.class_weight: Class balancing

๐Ÿ”ฎ Future Improvements

Planned Enhancements

  1. Model Architecture

    • Transformer-based models (BERT, RoBERTa)
    • Multi-head attention
    • Larger pre-trained embeddings (GloVe, Word2Vec)
  2. Features

    • Email header analysis (SPF, DKIM, DMARC)
    • Sender reputation scoring
    • Domain age verification
    • Link destination checking
    • Attachment analysis
  3. Data

    • Expand training dataset
    • Real-time data collection
    • Multi-language support
    • Email thread context
  4. User Interface

    • Browser extension
    • Email client plugins (Gmail, Outlook)
    • Mobile app
    • API for integration
  5. Performance

    • Model quantization for faster inference
    • Edge deployment
    • Real-time streaming analysis
    • Batch processing mode

๐Ÿค Contributing

Contributions are welcome! Here's how you can help:

  1. Report Bugs: Open an issue with details
  2. Suggest Features: Propose new functionality
  3. Submit Pull Requests:
    • Fork the repository
    • Create a feature branch
    • Make your changes
    • Submit a PR with description

Development Guidelines

  • Follow PEP 8 style guide
  • Add docstrings to functions
  • Include comments for complex logic
  • Test changes before submitting

๐Ÿ“„ License

This project is created for educational purposes as part of a semester mini-project.


๐Ÿ‘ฅ Authors

JHASHANK ACHARYA

  • Semester 5 Mini Project
  • Email Scam Detection System

๐Ÿ™ Acknowledgments

  • Enron Dataset: Used for training and evaluation
  • PyTorch Community: Excellent documentation and tutorials
  • Flask: Simple and powerful web framework
  • Open Source Contributors: Libraries and tools that made this possible

๐Ÿ“ž Support

For questions or issues:

  1. Check existing documentation
  2. Review code comments
  3. Open an issue on the repository
  4. Contact the project maintainer

๐Ÿ” Security Note

This system is designed to assist in identifying scam emails but should not be the sole method of protection. Always:

  • Verify sender email addresses
  • Be cautious with unexpected emails
  • Never share sensitive information via email
  • Use official websites (not email links) for account access
  • Enable two-factor authentication
  • Keep software updated

๐Ÿ“ˆ Version History

Version 1.0 (Current)

  • Initial release
  • Bidirectional LSTM with attention
  • Feature engineering system
  • Flask web interface
  • Optimized threshold detection

Last Updated: November 17, 2025

About

Email scam detector combining linguistically informed feature engineering with a PyTorch classifier trained on an Enron-derived labeled corpus. Extracts domain, metadata, stylistic and entropy features to produce interpretable signals and a calibrated confidence score. Users adjust thresholds; outputs: verdict, confidence %, and concise summary.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published