This document outlines a compact version of the full strategy for the Kaggle "Identify the Author" competition. The task is to predict the author of short literary excerpts using machine learning and NLP techniques.
Classify excerpts by one of three authors using a multi-class model evaluated by log-loss.
-
Data Loading: Use
pandasto loadtrain.csvandtest.csv. Confirm structure and balance across authors. -
Text Preprocessing: Lowercase the text, tokenize using spaCy or NLTK, lemmatize, and retain stopwords as they may carry stylistic signals. Punctuation is removed for most features but counted separately.
-
Feature Engineering: Use TF-IDF on word and character n-grams as the core feature set. Add sentence-level features like length, punctuation counts, and readability scores. Include dense vector features via GloVe and BERT sentence embeddings.
-
Model Training: Train Logistic Regression on TF-IDF (strong baseline), Naive Bayes as quick benchmark, XGBoost on dense features, Bi-LSTM with GloVe embeddings, and fine-tuned BERT for deep context.
-
Evaluation: Use Stratified K-Fold cross-validation. Focus on log-loss. Apply probability calibration if needed and analyze misclassifications with confusion matrices.
-
Ensembling: Blend Logistic Regression, BERT, and optionally LSTM or XGBoost using weighted averaging or stacking. Choose weights based on cross-val log-loss.
-
Submission: Output class probabilities for each test example. Ensure predictions are well-calibrated and formatted per sample_submission.csv.
project-root/
├── data/ # Raw CSV files
├── notebooks/ # EDA and modeling
├── src/ # Scripts
├── outputs/ # Submissions
├── README.md
└── requirements.txt
Start with strong baselines like TF-IDF + Logistic Regression. Enhance with stylometric features and contextual embeddings. Fine-tune BERT for state-of-the-art performance. Blend multiple models for improved log-loss. Stick to rigorous validation and calibrate probabilities.
This structured strategy balances classical NLP, modern transformers, and ensemble modeling to compete effectively in the challenge.