Skip to content

Conversation

@Asantewaah
Copy link
Collaborator

Summary

This draft PR introduces the collaborative strategy, LassoCTMLE, to the TMLE.jl package.
LassoCTMLE fits a LASSO-regularized logistic regression model (using GLMNet.jl) for the propensity score (g-model), with cross-validated lambda selection. The strategy is intended to be generic, supporting any parameter Ψ (not just ATE), and integrates with the existing TMLE.jl API, as advised.

Motivation for this estimator

  • Enables flexible, regularized estimation of the g-model, which can improve robustness in high-dimensional or collinear settings.
  • Helps users avoid overfitting and select an appropriate level of regularization via cross-validation.

Implementation Details

  • It uses cross-validation for lambda selection.

Testing

  • I would appreciate guidance on best practices for testing strategies in TMLE.jl.
  • Are there recommended datasets or paradigms in the test/ folder I should follow?
  • What are the edge cases or parameterizations I should cover?
  • Should I benchmark against existing strategies, and if so, with which datasets?

Open Questions

  • I currently get UndefVarError: TMLE not defined in Main when I try to run the singular lasso_strategy script, just to check if everything is imported or written well. What's the natural way to import or use the package to test the code I write?
  • I'm not sure if I succeeded in making the current constructor/API for LassoCTMLE idiomatic for TMLE.jl?

Next Steps

  • Integrate feedback from Olivier.
  • Add more tests and documentation as needed.

Thank you for reviewing.

@Asantewaah
Copy link
Collaborator Author

@olivierlabayle While I was trying to revert the changes vscode did to the 56 files, I think I ended up deleting the first draft pull request since at that point my branch looked identical to the main branch so I started a new draft pull request for us to communicate and I would figure out not to let those changes happen again.

@olivierlabayle
Copy link
Member

No problem! Do you have an test idea? That is, a simulation setting in which you expect this Lasso-CTMLE to perform better than the canonical one?

@Asantewaah
Copy link
Collaborator Author

No problem! Do you have an test idea? That is, a simulation setting in which you expect this Lasso-CTMLE to perform better than the canonical one?

Yes, I actually have a setup I have been using ever since I started building the estimator; it's basically a simulation using the Toeplitz function so we can adjust or vary the correlation parameter. From my tests, the higher the correlation, rho, between the selection covariates, the better C-TMLE performs compared to TMLE. I will add the Julia converted code to the test/lasso_strategy.jl. Still going through how to actually test anything new I add to the package.

@olivierlabayle
Copy link
Member

In order. to test you simply use the estimator on your generated dataset, for the structure of the file, you can basically look at any other test file, for instance this is an easy one to read: https://github.com/TARGENE/TMLE.jl/blob/main/test/counterfactual_mean_based/non_regression_test.jl

@Asantewaah
Copy link
Collaborator Author

Hi @olivierlabayle ,

I tried to mirror the syntax used in non_regression_test.jl for my LassoCTMLE test, as suggested. However, I keep getting this recurring error:

ERROR: UndefVarError: `LassoCTMLE` not defined in `Main`

I also tried calling the other strategies you created for the template much earlier (AdaptiveCorrelationStrategy, GreedyStrategy), and they’re giving the same error, even though they are exported and their files are included in TMLE.jl.

Do you have any idea why these strategies aren’t visible, or what I might be missing?

Thanks

@olivierlabayle
Copy link
Member

Hi @olivierlabayle ,

I tried to mirror the syntax used in non_regression_test.jl for my LassoCTMLE test, as suggested. However, I keep getting this recurring error:

ERROR: UndefVarError: `LassoCTMLE` not defined in `Main`

I also tried calling the other strategies you created for the template much earlier (AdaptiveCorrelationStrategy, GreedyStrategy), and they’re giving the same error, even though they are exported and their files are included in TMLE.jl.

Do you have any idea why these strategies aren’t visible, or what I might be missing?

Thanks

You need to push a test file so that I can see what you are doing, from your branch, this works perfectly fine:

image

The reason why your strategy is not visible is because you haven't included the source file in the package. If you include the file you will likely have an error when trying to import the package because you import other packages that are not in the Projrct.toml. As a reminder, I suggested reading the contribution guide here: https://targene.github.io/TMLE.jl/previews/PR135/contributing/. I linked a youtube video which is worth watching to understand the basics of Julia packages.

@Asantewaah
Copy link
Collaborator Author

Hi @olivierlabayle ,
I tried to mirror the syntax used in non_regression_test.jl for my LassoCTMLE test, as suggested. However, I keep getting this recurring error:

ERROR: UndefVarError: `LassoCTMLE` not defined in `Main`

I also tried calling the other strategies you created for the template much earlier (AdaptiveCorrelationStrategy, GreedyStrategy), and they’re giving the same error, even though they are exported and their files are included in TMLE.jl.
Do you have any idea why these strategies aren’t visible, or what I might be missing?
Thanks

You need to push a test file so that I can see what you are doing, from your branch, this works perfectly fine:

image The reason why your strategy is not visible is because you haven't included the source file in the package. If you include the file you will likely have an error when trying to import the package because you import other packages that are not in the Projrct.toml. As a reminder, I suggested reading the contribution guide here: https://targene.github.io/TMLE.jl/previews/PR135/contributing/. I linked a youtube video which is worth watching to understand the basics of Julia packages.

I did watch the video, really detailed but I don't understand why (AdaptiveCorrelationStrategy,and GreedyStrategy) is not registed when I call them even though I did add those. before I push the test files, I try to run it locally first so I don't mess up anything. I just commited it.

@olivierlabayle
Copy link
Member

There were quite a few problems:

  • You must not reference the parent package from within it (e.g. using ..TMLE). The functionality you are creating lives within TMLE.
  • You have made breaking changes to the dependencies (which are hopefully solved now). Regarding dependencies, the general rule is to use a few as possible in order to keep the package as lightweight as possible. In this case you will need GLMNet, so keep it for now, and we can discuss later how to make this an extension instead of having it as a true dependency. However, do you really need StatsFun?

As you can see the package is broken because GLMNet and MLJBase both define the predict function. I suggest to use import GLMNet instead of using GLMNet to solve this for now. Then you will have to explicitely prepend all GLMNet function calls by GLMNet..

@olivierlabayle
Copy link
Member

I have added some documentation in order to setup your local environment: https://targene.github.io/TMLE.jl/previews/PR135/contributing/#Environment

@Asantewaah
Copy link
Collaborator Author

Hi @olivierlabayle ,

I’ve resolved my environment issues and some other syntax inconsistencies with my code, but now the LassoCTMLE test fails with:

MethodError: no method matching glmnet!(::Matrix{Float64}, ::Vector{Float64}, ::Binomial{Float64}; lambda::Vector{Float64})

I’ve tried both glmnet and glmnet! but still get the same error about argument types.
Pushing my code now for you to see.

Thanks.

Copy link
Member

@olivierlabayle olivierlabayle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Asantewaa, looks like you are in a good position to make some progress on testing your implementation now!

)
)
lasso_result, _ = lasso_estimator(Ψ, dataset; verbosity = 0)
@test !isnan(estimate(lasso_result))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you notice any improvement using this estimator e.g., coverage ? bias ? variance ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't passed the test; there seems to be an issue with how the data is being passed to the propensity score. I will take the package (Toeplitz) out of the main environment and add it to the test environment instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reflect this you need to include the test/counterfactual_mean_based/lasso_strategy.jl in the test/runtest.jl file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve just done that, I’m still getting the same error which is basically a call on the wrong use of the propensity_score function you defined in the src/counterfactual_mean_based/collaborative_template.jl, I’m not sure what exactly I’m doing wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the tests say that you need to install : LinearAlgebra first. You need to add it to the test dependencies again. I can't see the error otherwise but I am pretty sure you are not calling the propensity_score method with an appropriate object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell what is going on locally, perhaps restart Julia (if you haven't done so already) and make sure your installation is clean.
Regarding the CI problem, one thing that is directly obvious is that the treatmnet should be categorical as per this page: https://targene.github.io/TMLE.jl/stable/walk_through/#The-Dataset. So you should call categorical(A) in your simulation dataframe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @olivierlabayle , so I hace done that but I'm still getting that error and also, there seem to be a problem with the propensity score that says : Got exception outside of a @test
Could not fit the following propensity score model: P₀(A | W1, W10, W2, W3, W4, W5, W6, W7, W8, W9).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to develop an understanding of the objects you are manipulating to solve your problems, for instance this cannot be correct: ĝ = ConditionalDistribution(g_fit, strategy.confounders) because a ConditionalDistribution is an estimand not an estimate. I suggest you write tests for each function you develop to make sure the outputs are what you expect. This will divide the big problem into smaller problems and guarantee correctness across the codebase. For example, crossvalidate_lambda, iterate for the LassoCTMLEIterator are functions that can and must be tested. You will realise there is likely quite a few problems in your code. The current tests for other strategies can be helpful: https://github.com/TARGENE/TMLE.jl/blob/main/test/counterfactual_mean_based/covariate_based_strategies.jl.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Honestly, this entire integration has been a little overwhelming especially coming into the same errors so many times, but I will break everything down and try to get them right before putting them together. Also, is it best to still call the propensity_score function you wrote or I write my own version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is for you to integrate as much as possible with the existing codebase so any function you can reuse you should. However the codebase is becoming quite large and I appreciate it might be difficult to get it correct right away. Start by coming up with something that works and then we can see together how to integrate it better within the existing code, step by step.

@Asantewaah
Copy link
Collaborator Author

Hi @olivierlabayle ,

I think the integration is working fine now. The code is a little messy, but it runs and all the tests pass. I also added an example in the examples folder, where I created a function for the Toeplitz simulation and compared its estimates to TMLE. The only thing I ran into was with the lambda values for the lasso regression, the automatically(using cross validation) generated ones weren’t being applied properly during the updating step, which made the results come out the same as TMLE. To work around this, I set up a way to manually specify lambda values, and for now there’s a mix of manual and automatic comparison logic (you’ll see that in the tests).

If you could take a look when you have a chance, that’d be great.

@Asantewaah
Copy link
Collaborator Author

Hi @olivierlabayle

So I've got a bootstrap analysis working with some nice visualizations. The results look good.

I did ran into was some dependency conflicts with ToeplitzMatrices.jl - it kept causing issues, so I just wrote a simple create_toeplitz_matrix() function from scratch instead.

I am getting some package extension warnings during the environment switching (stuff like IntervalSets, Zygote extensions complaining), but everything actually works.

I also made sure the verbose control system works well, so by default everything runs silently for production use, but you can flip verbose=true if you want to see all the variable selection details during development(the generated lambda, the glmnet results, the selected confounders......).


Copy link
Member

@olivierlabayle olivierlabayle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Asantewaa,

Thanks for meeting today and good effort with the PR! I've left a few comments on the code as discussed today. Here are the more general notes I took as I was reviewing for completeness.

The R package for reference: https://github.com/jucheng1992/ctmle/tree/master

Implementation

  1. Initial estimator:
    1. Currently this is done with the user provided G model which could be anything —> problem, the lasso CTMLE must work only with a G model that is a glmnet (see point below)
    2. I think the initial G should be fitted with glmnetcv —> need to create a GLMNet MLJ compliant model, take inspiration / use the one in TMLECLI: https://github.com/TARGENE/TMLECLI.jl/blob/main/src/models/glmnet.jl
  2. Create a sequence of estimators: the sequence of lambdas in the previous step defines the sequence
  3. Evaluate on cv:
    1. I believe StepKPropensityScoreIterator should iterate only once and return a “fixed estimator” that returns the precomputed linear model from glmnetcv. The propensity score is the original propensity score.
    2. Exhausted returns true at the end of the sequence or if patience is reached
    3. This yields a lambda CTMLE and associated GLM / Qstar
  4. There seems to be a termination step which I can’t see in the current implementation. I am also not entirely clear what is the logic I’ve seen in the paper.

Tests

  • Tests are too sparse, we need to test intermediate functions and logic
  • How does it work with categorical treatments? What about multiple treatments?

Example/Docs

  • The current comparison is with a standard estimator only using linear models (no cross validation). In order to disentangle the effect of model specification from the effect of C-TMLE, you need to provide a GLMNet to the standard estimator’s G model.
  • Only code and documentation goes into the repo, hence temporary plots you make locally should not be committed to the repo. If you want to share some results with me you can add them to the PR or send them over Teams.
  • In order to later integrate within the docs you don’t need to do using Pkg and Pkg.activate. The docs environment is used in the docs and the TMLE packaged is added to it with dev. Look at the other examples for how to make your own.

using Test
using TMLE

TEST_DIR = joinpath(pkgdir(TMLE), "test")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert the changes to this file. The joinpath is required to make sure the tests runs regardless of the run directory.

```
"""
mutable struct LassoCTMLE <: CollaborativeStrategy
confounders::Vector{Symbol}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the strategy require confounders? Should these not simply be the estimand's confounders?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is old, it probably needs to be removed from the repo

using AutoHashEquals
using StatisticalMeasures
using DataFrames
import GLMNet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will eventually need to make GLMNet a package extension. That is we only want to load the LassoCTMLE code when the use loads GLMNet and not have GLMnet as a direct TMLE.jl dependency. Do you think you could do that?

Some docs that can help:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it a try after dealing with the other comments and revisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make this example compatible with and included in the docs? This is where you can add an example file to the docs. The file must respect the Literate.jl format, which is a plain script. Use comments to drive the narative of what the example is and what it shows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are welcome to use the other examples as a source of inspiration to build yours.

lambda_path::Vector{Float64}
cv_folds::Int
alpha::Float64
verbose::Bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed last time the verbosity level is not decided by the LassoCTMLE.


function LassoCTMLE(;
patience = 5,
lambda_path = :cv,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are defined by the G glmnetcv procedure.

Extract a vector of confounder symbols from the estimand `Ψ`.
Collects treatment-specific confounders (in order) and returns unique symbols.
"""
function extract_confounders_from_estimand(Ψ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely not needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, we will need more tests to convince ourselves and others that the code is doing what it is intended to do. As a guiding principle, think that almost all the functions you write should be tested if possible. For the statistical validity of the estimator, the documented example you have can suffice as it can be considered an end to end test.

end

selected_vars = var_names[selected_indices]
log_info(strategy, "GLMNet: α=$alpha, λ=$lambda$(length(selected_vars))/$(length(var_names)) variables selected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is excessible logging in this PR which makes it difficult to navigate. Print statements that you use for your own debugging should be removed when you comit to keep the code as simple as possible.

return unique(all)
end

function initialise!(strategy::LassoCTMLE, Ψ)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

half of this function definition are logs. Remove.

@Asantewaah
Copy link
Collaborator Author

Hi @olivierlabayle ,

I've implemented your suggestion to avoid refitting, using GLMNet CV's optimal lambda directly without iterating through multiple candidates.

I had to add two small helpers (PrefitGLMNetConditionalDistribution in estimates.jl and PrefitGLMNetJointConditionalDistributionEstimator in estimators.jl) to store and reuse the CV coefficients without refitting, I hope that's alright?

Oh and I've cleaned up the LassoCTMLE struct by removing all the collaborative iteration stuff (patience, lambda_path, etc).

The tests all pass.

Still working on adapting the example file for the docs and I'm yet to look into making GLMNet a package extension, it's a lot to go through but I created the GLMNetExt.jl file[it's empty] just to start

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants