AI-powered audio authenticity verification using CNN-based spectrogram analysis.
This system uses a Convolutional Neural Network (CNN) to detect deepfake audio by analyzing Log-Mel spectrograms. The model is trained on 2-second audio clips and can distinguish between real and AI-generated audio with high accuracy.
- Input: Log-Mel spectrogram (1x128x200)
- Architecture: 4 convolutional blocks with increasing filters (32, 64, 128, 256)
- Output: Fakeness score (0-1)
- Parameters: ~1.5M trainable parameters
- Model Size: ~6MB
- Training: 13,956 samples (6,978 fake + 6,978 real)
- Testing: 1,088 samples (544 fake + 544 real)
- Format: 2-second WAV files, preprocessed at 16kHz
- Expected Accuracy: 90-95%
- Expected F1 Score: 0.88-0.93
- Inference Time: <100ms per clip
- Training Time: 2-3 hours on dual RTX 5060 Ti
FAC/
├── backend/
│ ├── model.py # CNN architecture
│ ├── preprocessing.py # Spectrogram generation
│ ├── augmentation.py # Data augmentation (5 techniques)
│ ├── dataset.py # PyTorch Dataset & DataLoader
│ ├── training.py # Training loop with validation
│ ├── server.py # Flask API
│ ├── main.py # Main orchestrator
│ ├── requirements.txt # Python dependencies
│ ├── checkpoints/ # Saved models
│ └── results/ # Training graphs & metrics
├── frontend/
│ ├── index.html # UI structure
│ ├── style.css # Styling
│ ├── main.js # Frontend logic
│ └── package.json # Node dependencies
└── Dataset/
└── for-2sec/for-2seconds/
├── training/ # Training data
├── testing/ # Test data
└── validation/ # Validation data
- Create a virtual environment:
cd backend
python -m venv .venv
.venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt- Install Node.js dependencies:
cd frontend
npm installTrain the model on your dataset:
cd backend
python training.pyThis will:
- Load the dataset from
H:\FAC\Dataset - Train using dual GPUs (RTX 5060 Ti) with DataParallel
- Apply 5 augmentation techniques during training
- Save the best model to
checkpoints/best_model.pth - Generate validation graphs in
results/
Training takes approximately 2-3 hours.
cd backend
python server.pyThe Flask API will start on http://localhost:5000
In a separate terminal:
cd frontend
npm run devThe frontend will start on http://localhost:5173
Open http://localhost:5173 in your browser.
Upload Mode:
- Drag & drop an audio file or click to browse
- Supported formats: MP3, WAV, OGG
- Get instant analysis results
Check if model is loaded and get accuracy metrics.
Response:
{
"loaded": true,
"device": "cuda",
"accuracy": 0.93,
"version": "1.0"
}Upload an audio file for analysis.
Request: multipart/form-data with file
Response:
{
"fakeness_score": 0.87,
"prediction": "fake",
"confidence": 0.87
}Accepts base64-encoded audio for analysis.
Request:
{
"audio": "base64_encoded_audio_data"
}Response:
{
"fakeness_score": 0.23,
"prediction": "real",
"confidence": 0.77
}- Time Masking - Masks random time frames
- Frequency Masking - Masks random mel bins
- Gaussian Noise - Adds random noise
- Codec Compression - Simulates MP3/AAC artifacts
- Pitch Shift - Random pitch changes (±2 semitones)
- Automatic detection of available GPUs
- DataParallel for batch splitting across GPUs
- Optimized batch size (64 total, 32 per GPU)
- ~2x training speedup with dual GPUs
After training, comprehensive validation generates:
- Training History - Loss and accuracy curves
- ROC Curve - With AUC score
- Confusion Matrix - True/False positives/negatives
- Probability Distribution - Model confidence visualization
- Precision-Recall Curve - Performance across thresholds
All graphs saved to backend/results/
| Training History | ROC Curve | Confusion Matrix |
|---|---|---|
![]() |
![]() |
![]() |
| Probability Distribution | Precision-Recall Curve |
|---|---|
![]() |
![]() |
- Sample rate: 16kHz
- FFT size: 1024
- Hop length: 256
- Mel bins: 128
- Normalization: Per-sample z-score
- Loss: BCELoss
- Optimizer: AdamW (lr=3e-4)
- Scheduler: ReduceLROnPlateau (patience=3)
- Batch size: 64 (32 per GPU)
- Epochs: 30-40
- Early stopping: patience=5
- Checkpointing: Best F1 score
Input [1, 128, 200]
↓
Conv Block 1: 1→32 filters
↓
Conv Block 2: 32→64 filters
↓
Conv Block 3: 64→128 filters
↓
Conv Block 4: 128→256 filters
↓
AdaptiveAvgPool (1x1)
↓
FC: 256→128 + Dropout(0.3)
↓
FC: 128→1 + Sigmoid
↓
Output: Fakeness score [0-1]
- Ensure you've trained the model first:
python training.py - Check that
checkpoints/best_model.pthexists
- Reduce batch size in
training.py(default: 64) - Reduce num_workers in DataLoader (default: 8)
- Ensure Flask server is running on port 5000
- Check CORS settings in
server.py - Verify API_BASE_URL in
main.js
See LICENSE file for details.
- Dataset: Fake-or-Real Audio Dataset
- Framework: PyTorch
- Frontend: Vanilla JavaScript with Vite
- Backend: Flask
Status: ✅ Complete | Training F1: 99.79% (validation) | Test F1: 39.8% | Demonstrates: Generalization challenges in deepfake detection
An honest, AI-assisted research project that:
- ✅ Achieves 99.79% F1 on training distribution
- ✅ Demonstrates complete ML pipeline (data → training → deployment)
- ✅ Reveals real-world generalization challenges
- ✅ Provides comprehensive analysis and visualizations
- ✅ Includes honest assessment of limitations
"I did what I could. This is an AI-assisted project, but it's an honest one because I did the work of learning, structuring, and lecturing myself on its logic"
- 📁 File upload with drag & drop support
- 📊 Complete training analysis with 5 visualization graphs
- 🔬 Technical documentation with architecture details
- ⚡ Fast inference (<100ms per clip on GPU)
- 🎯 Honest disclaimers about limitations
For complete documentation, see PROJECT_SUMMARY.md




