CommonLit Readability Prize

Introduction

Title	Text
Intro	Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? Reading is an essential skill for academic success. When students have access to engaging passages offering the right level of challenge, they naturally develop reading skills. Currently, most educational texts are matched to readers using traditional readability methods or commercially available formulas. However, each has its issues. Tools like Flesch-Kincaid Grade Level are based on weak proxies of text decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence). As a result, they lack construct and theoretical validity. At the same time, commercially available formulas, such as Lexile, can be cost-prohibitive, lack suitable validation studies, and suffer from transparency issues when the formula's features aren't publicly available.
Data	The dataset consisted of Train == 12000, Test == 1200, Sample_Submition, Nigerian_State_LGA_Name.
Metrics	F1_score for evaluating our algorithm.
ML Task	Binary Classification task.

Problems

id - unique ID for excerpt
url_legal - URL of source - this is blank in the test set.
license - license of source material - this is blank in the test set
excerpt - text to predict reading ease of
target - reading ease
standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

Solved

Used RandomOverSampler algorithm to oversample the minority class.
I tried to impute NaNs with Iterative-Imputer and KNN-Imputer.
I used absolute value of Age to fix negative values.
When I deleted duplicated values I got lower F1_score in public LB so I did not fix it. But in private LB I found out I should have deleted it.
Interestingly I used Nigerian_State_LGA_Name dataset to correct Names in LGA and State.
I again did not fix duplicated rows with different targets.

Unsolved

Did not pay attention to scaling, transforming, feature selection, which led to overfitting.
rather than following ML rules I followed what public LB told me about duplicated rows.
I did not use Stacking or boosting from ensembles efficiently.

Algorithms Used

CatBoost for binary Classification.
Iterative-Imputer with ExtraTrees for Imputing Missing Values by Label-Encoding the categorical dtype.
RandomOverSampler for Over-Sampling minority class.
Others.

🛠 Tech Tools

👾
⚙️
💻

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Dataset		Dataset
Bert_Optimization.ipynb		Bert_Optimization.ipynb
Certificate.png		Certificate.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CommonLit Readability Prize

Introduction

Problems

Solved

Unsolved

Algorithms Used

🛠 Tech Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CommonLit Readability Prize

Introduction

Problems

Solved

Unsolved

Algorithms Used

🛠 Tech Tools

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages