[Quesiton] Imbalanced data for classifiers in classification tasks

Thank you for all your hard work. 

I've noticed that in your implementation of classifiers in `ClassificationEvaluator`, it seems that classifiers like `LogisticRegression` and `kNN` are [trained on the entire training datasets](https://github.com/sbintuitions/JMTEB/blob/fd5d438911fc29ed0d9240601668fcc5bcc316ba/src/jmteb/evaluators/classification/evaluator.py#L100) even for extremely imbalanced data such as `amazon_counterfactual` dataset, where 90% of the labels are `0` ([stats-ja](https://huggingface.co/datasets/mteb/amazon_counterfactual/viewer/ja/train)). In the original MTEB, this issue is addressed by [undersampling](https://github.com/embeddings-benchmark/mteb/blob/eece6ecdb3248b1fb5a044d33c904f3d48cd6eab/mteb/abstasks/AbsTaskClassification.py#L138) the training dataset to achieve a balanced distribution before fitting `LogisticRegression`.

Could you elaborate on your design choice for training on the entire dataset? Are there specific reasons for this approach? If I am missing something, feel free to correct me. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quesiton] Imbalanced data for classifiers in classification tasks #79

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Quesiton] Imbalanced data for classifiers in classification tasks #79

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions