Add LASSO-regularized propensity score (g-model) strategy (LassoCTMLE) #137

Asantewaah · 2025-06-19T12:44:50Z

Summary

This draft PR introduces the collaborative strategy, LassoCTMLE, to the TMLE.jl package.
LassoCTMLE fits a LASSO-regularized logistic regression model (using GLMNet.jl) for the propensity score (g-model), with cross-validated lambda selection. The strategy is intended to be generic, supporting any parameter Ψ (not just ATE), and integrates with the existing TMLE.jl API, as advised.

Motivation for this estimator

Enables flexible, regularized estimation of the g-model, which can improve robustness in high-dimensional or collinear settings.
Helps users avoid overfitting and select an appropriate level of regularization via cross-validation.

Implementation Details

It uses cross-validation for lambda selection.

Testing

I would appreciate guidance on best practices for testing strategies in TMLE.jl.
Are there recommended datasets or paradigms in the test/ folder I should follow?
What are the edge cases or parameterizations I should cover?
Should I benchmark against existing strategies, and if so, with which datasets?

Open Questions

I currently get UndefVarError: TMLE not defined in Main when I try to run the singular lasso_strategy script, just to check if everything is imported or written well. What's the natural way to import or use the package to test the code I write?
I'm not sure if I succeeded in making the current constructor/API for LassoCTMLE idiomatic for TMLE.jl?

Next Steps

Integrate feedback from Olivier.
Add more tests and documentation as needed.

Thank you for reviewing.

Asantewaah · 2025-06-19T12:47:12Z

@olivierlabayle While I was trying to revert the changes vscode did to the 56 files, I think I ended up deleting the first draft pull request since at that point my branch looked identical to the main branch so I started a new draft pull request for us to communicate and I would figure out not to let those changes happen again.

olivierlabayle · 2025-06-19T14:31:45Z

No problem! Do you have an test idea? That is, a simulation setting in which you expect this Lasso-CTMLE to perform better than the canonical one?

Asantewaah · 2025-06-19T14:48:40Z

No problem! Do you have an test idea? That is, a simulation setting in which you expect this Lasso-CTMLE to perform better than the canonical one?

Yes, I actually have a setup I have been using ever since I started building the estimator; it's basically a simulation using the Toeplitz function so we can adjust or vary the correlation parameter. From my tests, the higher the correlation, rho, between the selection covariates, the better C-TMLE performs compared to TMLE. I will add the Julia converted code to the test/lasso_strategy.jl. Still going through how to actually test anything new I add to the package.

olivierlabayle · 2025-06-19T14:58:29Z

In order. to test you simply use the estimator on your generated dataset, for the structure of the file, you can basically look at any other test file, for instance this is an easy one to read: https://github.com/TARGENE/TMLE.jl/blob/main/test/counterfactual_mean_based/non_regression_test.jl

Asantewaah · 2025-07-02T11:14:58Z

Hi @olivierlabayle ,

I tried to mirror the syntax used in non_regression_test.jl for my LassoCTMLE test, as suggested. However, I keep getting this recurring error:

ERROR: UndefVarError: `LassoCTMLE` not defined in `Main`

I also tried calling the other strategies you created for the template much earlier (AdaptiveCorrelationStrategy, GreedyStrategy), and they’re giving the same error, even though they are exported and their files are included in TMLE.jl.

Do you have any idea why these strategies aren’t visible, or what I might be missing?

Thanks

olivierlabayle · 2025-07-02T11:29:52Z

Hi @olivierlabayle ,

I tried to mirror the syntax used in non_regression_test.jl for my LassoCTMLE test, as suggested. However, I keep getting this recurring error:
ERROR: UndefVarError: `LassoCTMLE` not defined in `Main`
I also tried calling the other strategies you created for the template much earlier (AdaptiveCorrelationStrategy, GreedyStrategy), and they’re giving the same error, even though they are exported and their files are included in TMLE.jl.

Do you have any idea why these strategies aren’t visible, or what I might be missing?

Thanks

You need to push a test file so that I can see what you are doing, from your branch, this works perfectly fine:

The reason why your strategy is not visible is because you haven't included the source file in the package. If you include the file you will likely have an error when trying to import the package because you import other packages that are not in the Projrct.toml. As a reminder, I suggested reading the contribution guide here: https://targene.github.io/TMLE.jl/previews/PR135/contributing/. I linked a youtube video which is worth watching to understand the basics of Julia packages.

Asantewaah · 2025-07-02T12:21:17Z

Hi @olivierlabayle ,
I tried to mirror the syntax used in non_regression_test.jl for my LassoCTMLE test, as suggested. However, I keep getting this recurring error:
ERROR: UndefVarError: `LassoCTMLE` not defined in `Main`
I also tried calling the other strategies you created for the template much earlier (AdaptiveCorrelationStrategy, GreedyStrategy), and they’re giving the same error, even though they are exported and their files are included in TMLE.jl.
Do you have any idea why these strategies aren’t visible, or what I might be missing?
Thanks
You need to push a test file so that I can see what you are doing, from your branch, this works perfectly fine:
The reason why your strategy is not visible is because you haven't included the source file in the package. If you include the file you will likely have an error when trying to import the package because you import other packages that are not in the Projrct.toml. As a reminder, I suggested reading the contribution guide here: https://targene.github.io/TMLE.jl/previews/PR135/contributing/. I linked a youtube video which is worth watching to understand the basics of Julia packages.

I did watch the video, really detailed but I don't understand why (AdaptiveCorrelationStrategy,and GreedyStrategy) is not registed when I call them even though I did add those. before I push the test files, I try to run it locally first so I don't mess up anything. I just commited it.

olivierlabayle · 2025-07-02T12:42:44Z

There were quite a few problems:

You must not reference the parent package from within it (e.g. using ..TMLE). The functionality you are creating lives within TMLE.
You have made breaking changes to the dependencies (which are hopefully solved now). Regarding dependencies, the general rule is to use a few as possible in order to keep the package as lightweight as possible. In this case you will need GLMNet, so keep it for now, and we can discuss later how to make this an extension instead of having it as a true dependency. However, do you really need StatsFun?

As you can see the package is broken because GLMNet and MLJBase both define the predict function. I suggest to use import GLMNet instead of using GLMNet to solve this for now. Then you will have to explicitely prepend all GLMNet function calls by GLMNet..

olivierlabayle · 2025-07-02T13:53:34Z

I have added some documentation in order to setup your local environment: https://targene.github.io/TMLE.jl/previews/PR135/contributing/#Environment

Asantewaah · 2025-07-09T12:36:52Z

Hi @olivierlabayle ,

I’ve resolved my environment issues and some other syntax inconsistencies with my code, but now the LassoCTMLE test fails with:

MethodError: no method matching glmnet!(::Matrix{Float64}, ::Vector{Float64}, ::Binomial{Float64}; lambda::Vector{Float64})

I’ve tried both glmnet and glmnet! but still get the same error about argument types.
Pushing my code now for you to see.

Thanks.

src/TMLE.jl

test/counterfactual_mean_based/lasso_strategy_test.jl

src/counterfactual_mean_based/lasso_strategy.jl

olivierlabayle

Thanks Asantewaa, looks like you are in a good position to make some progress on testing your implementation now!

Project.toml

olivierlabayle · 2025-08-21T12:54:17Z

test/counterfactual_mean_based/lasso_strategy.jl

+        )
+    )
+    lasso_result, _ = lasso_estimator(Ψ, dataset; verbosity = 0)
+    @test !isnan(estimate(lasso_result))


Can you notice any improvement using this estimator e.g., coverage ? bias ? variance ?

I still haven't passed the test; there seems to be an issue with how the data is being passed to the propensity score. I will take the package (Toeplitz) out of the main environment and add it to the test environment instead.

To reflect this you need to include the test/counterfactual_mean_based/lasso_strategy.jl in the test/runtest.jl file

I’ve just done that, I’m still getting the same error which is basically a call on the wrong use of the propensity_score function you defined in the src/counterfactual_mean_based/collaborative_template.jl, I’m not sure what exactly I’m doing wrong.

At the moment the tests say that you need to install : LinearAlgebra first. You need to add it to the test dependencies again. I can't see the error otherwise but I am pretty sure you are not calling the propensity_score method with an appropriate object.

I can't tell what is going on locally, perhaps restart Julia (if you haven't done so already) and make sure your installation is clean.
Regarding the CI problem, one thing that is directly obvious is that the treatmnet should be categorical as per this page: https://targene.github.io/TMLE.jl/stable/walk_through/#The-Dataset. So you should call categorical(A) in your simulation dataframe.

Hi @olivierlabayle , so I hace done that but I'm still getting that error and also, there seem to be a problem with the propensity score that says : Got exception outside of a @test
Could not fit the following propensity score model: P₀(A | W1, W10, W2, W3, W4, W5, W6, W7, W8, W9).

You need to develop an understanding of the objects you are manipulating to solve your problems, for instance this cannot be correct: ĝ = ConditionalDistribution(g_fit, strategy.confounders) because a ConditionalDistribution is an estimand not an estimate. I suggest you write tests for each function you develop to make sure the outputs are what you expect. This will divide the big problem into smaller problems and guarantee correctness across the codebase. For example, crossvalidate_lambda, iterate for the LassoCTMLEIterator are functions that can and must be tested. You will realise there is likely quite a few problems in your code. The current tests for other strategies can be helpful: https://github.com/TARGENE/TMLE.jl/blob/main/test/counterfactual_mean_based/covariate_based_strategies.jl.

Alright. Honestly, this entire integration has been a little overwhelming especially coming into the same errors so many times, but I will break everything down and try to get them right before putting them together. Also, is it best to still call the propensity_score function you wrote or I write my own version?

The goal is for you to integrate as much as possible with the existing codebase so any function you can reuse you should. However the codebase is becoming quite large and I appreciate it might be difficult to get it correct right away. Start by coming up with something that works and then we can see together how to integrate it better within the existing code, step by step.

…correct Y calculation

Asantewaah · 2025-10-01T18:52:52Z

Hi @olivierlabayle ,

I think the integration is working fine now. The code is a little messy, but it runs and all the tests pass. I also added an example in the examples folder, where I created a function for the Toeplitz simulation and compared its estimates to TMLE. The only thing I ran into was with the lambda values for the lasso regression, the automatically(using cross validation) generated ones weren’t being applied properly during the updating step, which made the results come out the same as TMLE. To work around this, I set up a way to manually specify lambda values, and for now there’s a mix of manual and automatic comparison logic (you’ll see that in the tests).

If you could take a look when you have a chance, that’d be great.

Asantewaah · 2025-10-02T15:42:01Z

Hi @olivierlabayle

So I've got a bootstrap analysis working with some nice visualizations. The results look good.

I did ran into was some dependency conflicts with ToeplitzMatrices.jl - it kept causing issues, so I just wrote a simple create_toeplitz_matrix() function from scratch instead.

I am getting some package extension warnings during the environment switching (stuff like IntervalSets, Zygote extensions complaining), but everything actually works.

I also made sure the verbose control system works well, so by default everything runs silently for production use, but you can flip verbose=true if you want to see all the variable selection details during development(the generated lambda, the glmnet results, the selected confounders......).

olivierlabayle

Hi Asantewaa,

Thanks for meeting today and good effort with the PR! I've left a few comments on the code as discussed today. Here are the more general notes I took as I was reviewing for completeness.

The R package for reference: https://github.com/jucheng1992/ctmle/tree/master

Implementation

Initial estimator:
1. Currently this is done with the user provided G model which could be anything —> problem, the lasso CTMLE must work only with a G model that is a glmnet (see point below)
2. I think the initial G should be fitted with glmnetcv —> need to create a GLMNet MLJ compliant model, take inspiration / use the one in TMLECLI: https://github.com/TARGENE/TMLECLI.jl/blob/main/src/models/glmnet.jl
Create a sequence of estimators: the sequence of lambdas in the previous step defines the sequence
Evaluate on cv:
1. I believe StepKPropensityScoreIterator should iterate only once and return a “fixed estimator” that returns the precomputed linear model from glmnetcv. The propensity score is the original propensity score.
2. Exhausted returns true at the end of the sequence or if patience is reached
3. This yields a lambda CTMLE and associated GLM / Qstar
There seems to be a termination step which I can’t see in the current implementation. I am also not entirely clear what is the logic I’ve seen in the paper.

Tests

Tests are too sparse, we need to test intermediate functions and logic
How does it work with categorical treatments? What about multiple treatments?

Example/Docs 

The current comparison is with a standard estimator only using linear models (no cross validation). In order to disentangle the effect of model specification from the effect of C-TMLE, you need to provide a GLMNet to the standard estimator’s G model.
Only code and documentation goes into the repo, hence temporary plots you make locally should not be committed to the repo. If you want to share some results with me you can add them to the PR or send them over Teams.
In order to later integrate within the docs you don’t need to do using Pkg and Pkg.activate. The docs environment is used in the docs and the TMLE packaged is added to it with dev. Look at the other examples for how to make your own.

olivierlabayle · 2025-10-06T10:25:57Z

test/runtests.jl

 using Test
 using TMLE

-TEST_DIR = joinpath(pkgdir(TMLE), "test")


please revert the changes to this file. The joinpath is required to make sure the tests runs regardless of the run directory.

olivierlabayle · 2025-10-06T10:31:10Z

src/counterfactual_mean_based/lasso_strategy.jl

+```
+"""
+mutable struct LassoCTMLE <: CollaborativeStrategy
+    confounders::Vector{Symbol}


why does the strategy require confounders? Should these not simply be the estimand's confounders?

olivierlabayle · 2025-10-27T12:57:27Z

examples/lasso_example_old.jl

if this is old, it probably needs to be removed from the repo

olivierlabayle · 2025-10-27T13:00:43Z

src/TMLE.jl

 using AutoHashEquals
 using StatisticalMeasures
 using DataFrames
+import GLMNet


We will eventually need to make GLMNet a package extension. That is we only want to load the LassoCTMLE code when the use loads GLMNet and not have GLMnet as a direct TMLE.jl dependency. Do you think you could do that?

Some docs that can help:

https://pkgdocs.julialang.org/v1/creating-packages/#Conditional-loading-of-code-in-packages-(Extensions)

https://www.youtube.com/watch?v=TiIZlQhFzyk

I can give it a try after dealing with the other comments and revisions.

olivierlabayle · 2025-10-27T13:31:11Z

examples/lasso_example.jl

Could you make this example compatible with and included in the docs? This is where you can add an example file to the docs. The file must respect the Literate.jl format, which is a plain script. Use comments to drive the narative of what the example is and what it shows.

You are welcome to use the other examples as a source of inspiration to build yours.

olivierlabayle · 2025-10-27T13:33:48Z

src/counterfactual_mean_based/lasso_strategy.jl

+    lambda_path::Vector{Float64}
+    cv_folds::Int
+    alpha::Float64
+    verbose::Bool


As discussed last time the verbosity level is not decided by the LassoCTMLE.

test/Project.toml

olivierlabayle · 2025-10-27T13:51:58Z

src/counterfactual_mean_based/lasso_strategy.jl

+
+    function LassoCTMLE(; 
+        patience = 5,
+        lambda_path = :cv,


I think these are defined by the G glmnetcv procedure.

olivierlabayle · 2025-10-27T13:52:18Z

src/counterfactual_mean_based/lasso_strategy.jl

+Extract a vector of confounder symbols from the estimand `Ψ`.
+Collects treatment-specific confounders (in order) and returns unique symbols.
+"""
+function extract_confounders_from_estimand(Ψ)


This is likely not needed

olivierlabayle · 2025-10-27T13:56:01Z

test/counterfactual_mean_based/lasso_strategy.jl

Overall, we will need more tests to convince ourselves and others that the code is doing what it is intended to do. As a guiding principle, think that almost all the functions you write should be tested if possible. For the statistical validity of the estimator, the documented example you have can suffice as it can be considered an end to end test.

olivierlabayle · 2025-10-27T13:59:09Z

src/counterfactual_mean_based/lasso_strategy.jl

+        end
+
+        selected_vars = var_names[selected_indices]
+        log_info(strategy, "GLMNet: α=$alpha, λ=$lambda → $(length(selected_vars))/$(length(var_names)) variables selected")


There is excessible logging in this PR which makes it difficult to navigate. Print statements that you use for your own debugging should be removed when you comit to keep the code as simple as possible.

olivierlabayle · 2025-10-27T13:59:42Z

src/counterfactual_mean_based/lasso_strategy.jl

+    return unique(all)
+end
+
+function initialise!(strategy::LassoCTMLE, Ψ)


half of this function definition are logs. Remove.

Asantewaah · 2025-11-26T17:59:03Z

Hi @olivierlabayle ,

I've implemented your suggestion to avoid refitting, using GLMNet CV's optimal lambda directly without iterating through multiple candidates.

I had to add two small helpers (PrefitGLMNetConditionalDistribution in estimates.jl and PrefitGLMNetJointConditionalDistributionEstimator in estimators.jl) to store and reuse the CV coefficients without refitting, I hope that's alright?

Oh and I've cleaned up the LassoCTMLE struct by removing all the collaborative iteration stuff (patience, lambda_path, etc).

The tests all pass.

Still working on adapting the example file for the docs and I'm yet to look into making GLMNet a package extension, it's a lot to go through but I created the GLMNetExt.jl file[it's empty] just to start

add lasso_strategy

29b7782

Asantewaah requested a review from olivierlabayle June 19, 2025 12:44

data generation for test

57e1aee

test lassoCTMLE

a157e75

fix some problems

e8fddce

Asantewaah added 2 commits July 7, 2025 08:26

update packages

fb744d7

testing lasso strategy

e7441c3

olivierlabayle reviewed Aug 6, 2025

View reviewed changes

src/TMLE.jl Outdated Show resolved Hide resolved

src/TMLE.jl Outdated Show resolved Hide resolved

test/counterfactual_mean_based/lasso_strategy_test.jl Show resolved Hide resolved

src/counterfactual_mean_based/lasso_strategy.jl Outdated Show resolved Hide resolved

glmnet update

3293a91

olivierlabayle reviewed Aug 21, 2025

View reviewed changes

Asantewaah and others added 9 commits August 21, 2025 14:54

fix: add dataset argument

2311e9b

runtest & package update

95940ab

add LinearAlgebra[test env]

6e960a3

fix propensity_dcore

0118948

fix: ensure A is converted to Vector{Int} in simulate_highdim_lasso_data

4086aa9

fix: update simulate_highdim_lasso_data to use categorical for A and …

8ecca1f

…correct Y calculation

Refactor test file to simplify includes

ff6f894

add LASSO CTMLE example , fixed code and test

9d0d907

fix: remove package activation from lasso_strategy.jl

60a0dd3

add: bootstrap analysis

56c3d3e

olivierlabayle requested changes Oct 7, 2025

View reviewed changes

Asantewaah added 2 commits October 22, 2025 12:48

Add GLMNet MLJ wrapper and export models; keep LassoCTMLE original API

34c18d4

update user access to function atributes & add glmnet to MLJ suite

5f53d1d

olivierlabayle requested changes Oct 27, 2025

View reviewed changes

clean up and glmnet cv efficiency

11b7c7c

Add LASSO-regularized propensity score (g-model) strategy (LassoCTMLE) #137

Are you sure you want to change the base?

Add LASSO-regularized propensity score (g-model) strategy (LassoCTMLE) #137

Uh oh!

Conversation

Asantewaah commented Jun 19, 2025

Summary

Motivation for this estimator

Implementation Details

Testing

Open Questions

Next Steps

Uh oh!

Asantewaah commented Jun 19, 2025

Uh oh!

olivierlabayle commented Jun 19, 2025

Uh oh!

Asantewaah commented Jun 19, 2025

Uh oh!

olivierlabayle commented Jun 19, 2025

Uh oh!

Asantewaah commented Jul 2, 2025

Uh oh!

olivierlabayle commented Jul 2, 2025

Uh oh!

Asantewaah commented Jul 2, 2025

Uh oh!

olivierlabayle commented Jul 2, 2025

Uh oh!

olivierlabayle commented Jul 2, 2025

Uh oh!

Asantewaah commented Jul 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olivierlabayle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Asantewaah commented Oct 1, 2025

Uh oh!

Asantewaah commented Oct 2, 2025

Uh oh!

olivierlabayle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!