-
Notifications
You must be signed in to change notification settings - Fork 5
Add LASSO-regularized propensity score (g-model) strategy (LassoCTMLE) #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@olivierlabayle While I was trying to revert the changes vscode did to the 56 files, I think I ended up deleting the first draft pull request since at that point my branch looked identical to the main branch so I started a new draft pull request for us to communicate and I would figure out not to let those changes happen again. |
|
No problem! Do you have an test idea? That is, a simulation setting in which you expect this Lasso-CTMLE to perform better than the canonical one? |
Yes, I actually have a setup I have been using ever since I started building the estimator; it's basically a simulation using the Toeplitz function so we can adjust or vary the correlation parameter. From my tests, the higher the correlation, rho, between the selection covariates, the better C-TMLE performs compared to TMLE. I will add the Julia converted code to the test/lasso_strategy.jl. Still going through how to actually test anything new I add to the package. |
|
In order. to test you simply use the estimator on your generated dataset, for the structure of the file, you can basically look at any other test file, for instance this is an easy one to read: https://github.com/TARGENE/TMLE.jl/blob/main/test/counterfactual_mean_based/non_regression_test.jl |
|
Hi @olivierlabayle , I tried to mirror the syntax used in I also tried calling the other strategies you created for the template much earlier ( Do you have any idea why these strategies aren’t visible, or what I might be missing? Thanks |
You need to push a test file so that I can see what you are doing, from your branch, this works perfectly fine:
The reason why your strategy is not visible is because you haven't included the source file in the package. If you include the file you will likely have an error when trying to import the package because you import other packages that are not in the Projrct.toml. As a reminder, I suggested reading the contribution guide here: https://targene.github.io/TMLE.jl/previews/PR135/contributing/. I linked a youtube video which is worth watching to understand the basics of Julia packages. |
I did watch the video, really detailed but I don't understand why ( |
|
There were quite a few problems:
As you can see the package is broken because GLMNet and MLJBase both define the |
|
I have added some documentation in order to setup your local environment: https://targene.github.io/TMLE.jl/previews/PR135/contributing/#Environment |
|
Hi @olivierlabayle , I’ve resolved my environment issues and some other syntax inconsistencies with my code, but now the LassoCTMLE test fails with: I’ve tried both Thanks. |
olivierlabayle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Asantewaa, looks like you are in a good position to make some progress on testing your implementation now!
| ) | ||
| ) | ||
| lasso_result, _ = lasso_estimator(Ψ, dataset; verbosity = 0) | ||
| @test !isnan(estimate(lasso_result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you notice any improvement using this estimator e.g., coverage ? bias ? variance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still haven't passed the test; there seems to be an issue with how the data is being passed to the propensity score. I will take the package (Toeplitz) out of the main environment and add it to the test environment instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reflect this you need to include the test/counterfactual_mean_based/lasso_strategy.jl in the test/runtest.jl file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve just done that, I’m still getting the same error which is basically a call on the wrong use of the propensity_score function you defined in the src/counterfactual_mean_based/collaborative_template.jl, I’m not sure what exactly I’m doing wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the tests say that you need to install : LinearAlgebra first. You need to add it to the test dependencies again. I can't see the error otherwise but I am pretty sure you are not calling the propensity_score method with an appropriate object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't tell what is going on locally, perhaps restart Julia (if you haven't done so already) and make sure your installation is clean.
Regarding the CI problem, one thing that is directly obvious is that the treatmnet should be categorical as per this page: https://targene.github.io/TMLE.jl/stable/walk_through/#The-Dataset. So you should call categorical(A) in your simulation dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @olivierlabayle , so I hace done that but I'm still getting that error and also, there seem to be a problem with the propensity score that says : Got exception outside of a @test
Could not fit the following propensity score model: P₀(A | W1, W10, W2, W3, W4, W5, W6, W7, W8, W9).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to develop an understanding of the objects you are manipulating to solve your problems, for instance this cannot be correct: ĝ = ConditionalDistribution(g_fit, strategy.confounders) because a ConditionalDistribution is an estimand not an estimate. I suggest you write tests for each function you develop to make sure the outputs are what you expect. This will divide the big problem into smaller problems and guarantee correctness across the codebase. For example, crossvalidate_lambda, iterate for the LassoCTMLEIterator are functions that can and must be tested. You will realise there is likely quite a few problems in your code. The current tests for other strategies can be helpful: https://github.com/TARGENE/TMLE.jl/blob/main/test/counterfactual_mean_based/covariate_based_strategies.jl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. Honestly, this entire integration has been a little overwhelming especially coming into the same errors so many times, but I will break everything down and try to get them right before putting them together. Also, is it best to still call the propensity_score function you wrote or I write my own version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is for you to integrate as much as possible with the existing codebase so any function you can reuse you should. However the codebase is becoming quite large and I appreciate it might be difficult to get it correct right away. Start by coming up with something that works and then we can see together how to integrate it better within the existing code, step by step.
…correct Y calculation
|
Hi @olivierlabayle , I think the integration is working fine now. The code is a little messy, but it runs and all the tests pass. I also added an example in the If you could take a look when you have a chance, that’d be great. |
|
So I've got a bootstrap analysis working with some nice visualizations. The results look good. I did ran into was some dependency conflicts with ToeplitzMatrices.jl - it kept causing issues, so I just wrote a simple I am getting some package extension warnings during the environment switching (stuff like IntervalSets, Zygote extensions complaining), but everything actually works. I also made sure the verbose control system works well, so by default everything runs silently for production use, but you can flip |
olivierlabayle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Asantewaa,
Thanks for meeting today and good effort with the PR! I've left a few comments on the code as discussed today. Here are the more general notes I took as I was reviewing for completeness.
The R package for reference: https://github.com/jucheng1992/ctmle/tree/master
Implementation
- Initial estimator:
- Currently this is done with the user provided G model which could be anything —> problem, the lasso CTMLE must work only with a G model that is a glmnet (see point below)
- I think the initial G should be fitted with glmnetcv —> need to create a GLMNet MLJ compliant model, take inspiration / use the one in TMLECLI: https://github.com/TARGENE/TMLECLI.jl/blob/main/src/models/glmnet.jl
- Create a sequence of estimators: the sequence of lambdas in the previous step defines the sequence
- Evaluate on cv:
- I believe StepKPropensityScoreIterator should iterate only once and return a “fixed estimator” that returns the precomputed linear model from glmnetcv. The propensity score is the original propensity score.
- Exhausted returns true at the end of the sequence or if patience is reached
- This yields a lambda CTMLE and associated GLM / Qstar
- There seems to be a termination step which I can’t see in the current implementation. I am also not entirely clear what is the logic I’ve seen in the paper.
Tests
- Tests are too sparse, we need to test intermediate functions and logic
- How does it work with categorical treatments? What about multiple treatments?
Example/Docs
- The current comparison is with a standard estimator only using linear models (no cross validation). In order to disentangle the effect of model specification from the effect of C-TMLE, you need to provide a GLMNet to the standard estimator’s G model.
- Only code and documentation goes into the repo, hence temporary plots you make locally should not be committed to the repo. If you want to share some results with me you can add them to the PR or send them over Teams.
- In order to later integrate within the docs you don’t need to do
using PkgandPkg.activate. The docs environment is used in the docs and the TMLE packaged is added to it withdev. Look at the other examples for how to make your own.
| using Test | ||
| using TMLE | ||
|
|
||
| TEST_DIR = joinpath(pkgdir(TMLE), "test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert the changes to this file. The joinpath is required to make sure the tests runs regardless of the run directory.
| ``` | ||
| """ | ||
| mutable struct LassoCTMLE <: CollaborativeStrategy | ||
| confounders::Vector{Symbol} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the strategy require confounders? Should these not simply be the estimand's confounders?
examples/lasso_example_old.jl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is old, it probably needs to be removed from the repo
| using AutoHashEquals | ||
| using StatisticalMeasures | ||
| using DataFrames | ||
| import GLMNet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will eventually need to make GLMNet a package extension. That is we only want to load the LassoCTMLE code when the use loads GLMNet and not have GLMnet as a direct TMLE.jl dependency. Do you think you could do that?
Some docs that can help:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give it a try after dealing with the other comments and revisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make this example compatible with and included in the docs? This is where you can add an example file to the docs. The file must respect the Literate.jl format, which is a plain script. Use comments to drive the narative of what the example is and what it shows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are welcome to use the other examples as a source of inspiration to build yours.
| lambda_path::Vector{Float64} | ||
| cv_folds::Int | ||
| alpha::Float64 | ||
| verbose::Bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed last time the verbosity level is not decided by the LassoCTMLE.
|
|
||
| function LassoCTMLE(; | ||
| patience = 5, | ||
| lambda_path = :cv, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are defined by the G glmnetcv procedure.
| Extract a vector of confounder symbols from the estimand `Ψ`. | ||
| Collects treatment-specific confounders (in order) and returns unique symbols. | ||
| """ | ||
| function extract_confounders_from_estimand(Ψ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, we will need more tests to convince ourselves and others that the code is doing what it is intended to do. As a guiding principle, think that almost all the functions you write should be tested if possible. For the statistical validity of the estimator, the documented example you have can suffice as it can be considered an end to end test.
| end | ||
|
|
||
| selected_vars = var_names[selected_indices] | ||
| log_info(strategy, "GLMNet: α=$alpha, λ=$lambda → $(length(selected_vars))/$(length(var_names)) variables selected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is excessible logging in this PR which makes it difficult to navigate. Print statements that you use for your own debugging should be removed when you comit to keep the code as simple as possible.
| return unique(all) | ||
| end | ||
|
|
||
| function initialise!(strategy::LassoCTMLE, Ψ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
half of this function definition are logs. Remove.
|
Hi @olivierlabayle , I've implemented your suggestion to avoid refitting, using GLMNet CV's optimal lambda directly without iterating through multiple candidates. I had to add two small helpers ( Oh and I've cleaned up the The tests all pass. Still working on adapting the example file for the docs and I'm yet to look into making GLMNet a package extension, it's a lot to go through but I created the GLMNetExt.jl file[it's empty] just to start |


Summary
This draft PR introduces the collaborative strategy,
LassoCTMLE, to the TMLE.jl package.LassoCTMLEfits a LASSO-regularized logistic regression model (using GLMNet.jl) for the propensity score (g-model), with cross-validated lambda selection. The strategy is intended to be generic, supporting any parameter Ψ (not just ATE), and integrates with the existing TMLE.jl API, as advised.Motivation for this estimator
Implementation Details
Testing
test/folder I should follow?Open Questions
UndefVarError: TMLE not defined in Mainwhen I try to run the singular lasso_strategy script, just to check if everything is imported or written well. What's the natural way to import or use the package to test the code I write?LassoCTMLEidiomatic for TMLE.jl?Next Steps
Thank you for reviewing.