📝 Identify the Author - Kaggle Competition

This document outlines a compact version of the full strategy for the Kaggle "Identify the Author" competition. The task is to predict the author of short literary excerpts using machine learning and NLP techniques.

📌 Goal

Classify excerpts by one of three authors using a multi-class model evaluated by log-loss.

🚦 Overall Plan

Data Loading: Use pandas to load train.csv and test.csv. Confirm structure and balance across authors.
Text Preprocessing: Lowercase the text, tokenize using spaCy or NLTK, lemmatize, and retain stopwords as they may carry stylistic signals. Punctuation is removed for most features but counted separately.
Feature Engineering: Use TF-IDF on word and character n-grams as the core feature set. Add sentence-level features like length, punctuation counts, and readability scores. Include dense vector features via GloVe and BERT sentence embeddings.
Model Training: Train Logistic Regression on TF-IDF (strong baseline), Naive Bayes as quick benchmark, XGBoost on dense features, Bi-LSTM with GloVe embeddings, and fine-tuned BERT for deep context.
Evaluation: Use Stratified K-Fold cross-validation. Focus on log-loss. Apply probability calibration if needed and analyze misclassifications with confusion matrices.
Ensembling: Blend Logistic Regression, BERT, and optionally LSTM or XGBoost using weighted averaging or stacking. Choose weights based on cross-val log-loss.
Submission: Output class probabilities for each test example. Ensure predictions are well-calibrated and formatted per sample_submission.csv.

📁 Structure

project-root/
├── data/         # Raw CSV files
├── notebooks/    # EDA and modeling
├── src/          # Scripts
├── outputs/      # Submissions
├── README.md
└── requirements.txt

🏁 Summary

Start with strong baselines like TF-IDF + Logistic Regression. Enhance with stylometric features and contextual embeddings. Fine-tune BERT for state-of-the-art performance. Blend multiple models for improved log-loss. Stick to rigorous validation and calibrate probabilities.

This structured strategy balances classical NLP, modern transformers, and ensemble modeling to compete effectively in the challenge.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
__pycache__		__pycache__
data		data
src		src
README.md		README.md
diagnostics.py		diagnostics.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 Identify the Author - Kaggle Competition

📌 Goal

🚦 Overall Plan

📁 Structure

🏁 Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📝 Identify the Author - Kaggle Competition

📌 Goal

🚦 Overall Plan

📁 Structure

🏁 Summary

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages