Skip to content

DurdeuVlad/Identify-the-author

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📝 Identify the Author - Kaggle Competition

This document outlines a compact version of the full strategy for the Kaggle "Identify the Author" competition. The task is to predict the author of short literary excerpts using machine learning and NLP techniques.


📌 Goal

Classify excerpts by one of three authors using a multi-class model evaluated by log-loss.


🚦 Overall Plan

  1. Data Loading: Use pandas to load train.csv and test.csv. Confirm structure and balance across authors.

  2. Text Preprocessing: Lowercase the text, tokenize using spaCy or NLTK, lemmatize, and retain stopwords as they may carry stylistic signals. Punctuation is removed for most features but counted separately.

  3. Feature Engineering: Use TF-IDF on word and character n-grams as the core feature set. Add sentence-level features like length, punctuation counts, and readability scores. Include dense vector features via GloVe and BERT sentence embeddings.

  4. Model Training: Train Logistic Regression on TF-IDF (strong baseline), Naive Bayes as quick benchmark, XGBoost on dense features, Bi-LSTM with GloVe embeddings, and fine-tuned BERT for deep context.

  5. Evaluation: Use Stratified K-Fold cross-validation. Focus on log-loss. Apply probability calibration if needed and analyze misclassifications with confusion matrices.

  6. Ensembling: Blend Logistic Regression, BERT, and optionally LSTM or XGBoost using weighted averaging or stacking. Choose weights based on cross-val log-loss.

  7. Submission: Output class probabilities for each test example. Ensure predictions are well-calibrated and formatted per sample_submission.csv.


📁 Structure

project-root/
├── data/         # Raw CSV files
├── notebooks/    # EDA and modeling
├── src/          # Scripts
├── outputs/      # Submissions
├── README.md
└── requirements.txt

🏁 Summary

Start with strong baselines like TF-IDF + Logistic Regression. Enhance with stylometric features and contextual embeddings. Fine-tune BERT for state-of-the-art performance. Blend multiple models for improved log-loss. Stick to rigorous validation and calibrate probabilities.

This structured strategy balances classical NLP, modern transformers, and ensemble modeling to compete effectively in the challenge.

About

Authorship attribution model built for an internal ML competition at Endava. I fine-tuned a DeBERTa transformer and achieved 0.24 log loss, ranking ~30 out of 300 participants. Focused on model generalization, data cleaning, and avoiding known data leakage techniques.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages