You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? Reading is an essential skill for academic success. When students have access to engaging passages offering the right level of challenge, they naturally develop reading skills. Currently, most educational texts are matched to readers using traditional readability methods or commercially available formulas. However, each has its issues. Tools like Flesch-Kincaid Grade Level are based on weak proxies of text decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence). As a result, they lack construct and theoretical validity. At the same time, commercially available formulas, such as Lexile, can be cost-prohibitive, lack suitable validation studies, and suffer from transparency issues when the formula's features aren't publicly available.
Data
The dataset consisted of Train == 12000, Test == 1200, Sample_Submition, Nigerian_State_LGA_Name.
Metrics
F1_score for evaluating our algorithm.
ML Task
Binary Classification task.
Problems
id - unique ID for excerpt
url_legal - URL of source - this is blank in the test set.
license - license of source material - this is blank in the test set
excerpt - text to predict reading ease of
target - reading ease
standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.
Solved
Used RandomOverSampler algorithm to oversample the minority class.
I tried to impute NaNs with Iterative-Imputer and KNN-Imputer.
I used absolute value of Age to fix negative values.
When I deleted duplicated values I got lower F1_score in public LB so I did not fix it. But in private LB I found out I should have deleted it.
Interestingly I used Nigerian_State_LGA_Name dataset to correct Names in LGA and State.
I again did not fix duplicated rows with different targets.
Unsolved
Did not pay attention to scaling, transforming, feature selection, which led to overfitting.
rather than following ML rules I followed what public LB told me about duplicated rows.
I did not use Stacking or boosting from ensembles efficiently.
Algorithms Used
CatBoost for binary Classification.
Iterative-Imputer with ExtraTrees for Imputing Missing Values by Label-Encoding the categorical dtype.
RandomOverSampler for Over-Sampling minority class.