-
Notifications
You must be signed in to change notification settings - Fork 21
New Workflow: LDA then XGBoost #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
New workflow which first run LDA and then run XGBoost using the LDA results as the main score. This helps prevent overfitting with XGBoost, results pretty comparable to XGBoost
singjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the addition! I added some comments/suggestions. I am wondering/concerned how the multi learner does with over-fitting, because it seems to basically perform the learning and scoring twice on the same data for ss num iters and xval num iters. Can you add an example output of running LDA, XGBoost vs LDA_XGBoost, and show the score distributions and pp plots, if not too much work.
I was also thinking we should make PyProphetMultiLearner abstract, so that we can open it up to different kinds of combinations for multi sequence learners.
pyprophet/scoring/runner.py
Outdated
| # remove columns that are not needed for LDA | ||
| table_lda = self.table.drop(columns=["var_precursor_charge", "var_product_charge", "var_transition_count"], errors='ignore') | ||
|
|
||
| (result_lda, scorer_lda, weights_lda) = PyProphet(config_lda).learn_and_apply(table_lda) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this run the full learning and scoring on the ss num iters and xval num iters, and then do a second pass with XGBoost with the same data for ss num iters and xval num iters? I am wondering if this results in any over fitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it overfits however, it might be unnecessary to do that many iterations.
|
Doing eFDR / FDR identification curves with different workflows we can se that LDA_XGBoost is quite similar to XGBoost in terms of overfitting and actually overfits slightly less than XGBoost but all results look reasonable.
Here are PyProphet reports for different classifiers for diaPASEF single injection with an experimental library. |
Co-authored-by: Justin Sing <32938975+singjc@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new hybrid workflow that combines Linear Discriminant Analysis (LDA) and XGBoost classifiers to improve scoring performance and prevent pi0 errors commonly encountered with standalone XGBoost.
Key changes include:
- Addition of
LDA_XGBoostas a new classifier option that runs LDA first, then uses LDA scores as input to XGBoost - Implementation of a new multi-classifier learning framework with abstract base classes
- Test coverage for the new workflow including regression test outputs
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pyprophet/cli/score.py | Adds LDA_XGBoost to classifier choices and implements execution logic |
| pyprophet/_config.py | Updates type annotations and documentation for new classifier option |
| pyprophet/scoring/runner.py | Implements PyProphetMultiLearner base class and LDA_XGBoostMultiLearner |
| pyprophet/io/_base.py | Extends XGBoost-specific feature handling to include LDA_XGBoost |
| tests/test_pyprophet_score.py | Adds test case for the new LDA_XGBoost workflow |
| tests/_regtest_outputs/test_pyprophet_score.test_osw_11.out | Expected test output for LDA_XGBoost test case |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

New workflow which first runs LDA and then runs XGBoost using the LDA results as the main score. This helps prevent pi0 errors that run into with XGBoost.
Overall, the results seem quite comparable to just running XGBoost on my dataset.