From cd20f6909838e05d0333e6c0167b246d2f8f4203 Mon Sep 17 00:00:00 2001 From: Gauarv Chaudhary Date: Fri, 31 Oct 2025 16:18:13 +0530 Subject: [PATCH 1/2] Update multicollinearity advice in check_model vignette (#828) - Remove blanket recommendation to drop high VIF predictors - Add nuanced, research-backed guidance on when multicollinearity matters - Distinguish between prediction vs. interpretation goals - Add 6 academic references (2003-2021) supporting new approach - Improve prose clarity and structure --- vignettes/check_model.Rmd | 41 +++++++++++++++++++++++++++++++-------- 1 file changed, 33 insertions(+), 8 deletions(-) diff --git a/vignettes/check_model.Rmd b/vignettes/check_model.Rmd index 3e52ef797..05f034fd7 100644 --- a/vignettes/check_model.Rmd +++ b/vignettes/check_model.Rmd @@ -233,24 +233,37 @@ Dealing with outliers is not straightforward, as it is not recommended to automa ## Multicollinearity -This plot checks for potential collinearity among predictors. In a nutshell multicollinearity means that once you know the effect of one predictor, the value of knowing the other predictor is rather low. Multicollinearity might arise when a third, unobserved variable has a causal effect on each of the two predictors that are associated with the outcome. In such cases, the actual relationship that matters would be the association between the unobserved variable and the outcome. +This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*. -Multicollinearity should not be confused with a raw strong correlation between predictors. What matters is the association between one or more predictor variables, *conditional on the other variables in the model*. - -If multicollinearity is a problem, the model seems to suggest that the predictors in question don't seems to be reliably associated with the outcome (low estimates, high standard errors), although these predictors actually are strongly associated with the outcome, i.e. indeed might have strong effect (_McElreath 2020, chapter 6.1_). +Multicollinearity can arise when a third, unobserved variable causally affects multiple predictors that are associated with the outcome. When multicollinearity is present, the model may show that individual predictors do not appear reliably associated with the outcome (yielding low estimates and high standard errors), even when these predictors are actually strongly related to the outcome (_McElreath 2020, chapter 6.1_). ```{r eval=all(successfully_loaded[c("see", "ggplot2")])} # multicollinearity diagnostic_plots[[5]] ``` -The variance inflation factor (VIF) indicates the magnitude of multicollinearity of model terms. The thresholds for low, moderate and high collinearity are VIF values less than 5, between 5 and 10 and larger than 10, respectively (_James et al. 2013_). Note that these thresholds, although commonly used, are also criticized for being too high. _Zuur et al. (2010)_ suggest using lower values, e.g. a VIF of 3 or larger may already no longer be considered as "low". +The variance inflation factor (VIF) indicates the magnitude of multicollinearity of model terms. Common thresholds suggest that VIF values less than 5 indicate low collinearity, values between 5 and 10 indicate moderate collinearity, and values larger than 10 indicate high collinearity (_James et al. 2013_). However, these thresholds have been criticized for being too lenient. _Zuur et al. (2010)_ suggest using stricter criteria, where a VIF of 3 or larger may warrant concern. That said, VIF thresholds should be interpreted cautiously and in context (_O'Brien 2007_). -Our model clearly suffers from multicollinearity, as all predictors have high VIF values. +Our model clearly shows multicollinearity, as all predictors have high VIF values. -### How to fix this? +### How to interpret and address this? + +High VIF values indicate that coefficient estimates may be unstable and have inflated standard errors. However, **removing predictors with high VIF values is generally not recommended** as a blanket solution (_Vanhove 2021; Morrissey and Ruxton 2018; Gregorich et al. 2021_). Multicollinearity is primarily a concern for *interpretation* of individual coefficients, not for the model's overall predictive performance or for drawing inferences about the combined effects of correlated predictors. + +Consider these points when dealing with multicollinearity: + +1. **If your goal is prediction**, multicollinearity is typically not a problem. The model can still make accurate predictions even when predictors are highly correlated (_Feng et al. 2019; Graham 2003_). + +2. **If your goal is to interpret individual coefficients**, high VIF values signal that you should be cautious. The coefficients represent the effect of each predictor while holding all others constant, which may not be meaningful when predictors are strongly related. In such cases, consider: + - Interpreting coefficients jointly rather than individually + - Acknowledging the uncertainty in individual coefficient estimates + - Considering whether your research question truly requires separating the effects of correlated predictors + +3. **For interaction terms**, high VIF values are expected and often unavoidable. This is sometimes called "inessential ill-conditioning" (_Francoeur 2013_). Centering the component variables can sometimes help reduce VIF values for interactions (_Kim and Jung 2024_). -Usually, predictors with (very) high VIF values should be removed from the model to fix multicollinearity. Some caution is needed for interaction terms. If interaction terms are included in a model, high VIF values are expected. This portion of multicollinearity among the component terms of an interaction is also called "inessential ill-conditioning", which leads to inflated VIF values that are typically seen for models with interaction terms _(Francoeur 2013)_. In such cases, try centering the involved interaction terms, which can reduce multicollinearity _(Kim and Jung 2024)_, or re-fit your model without interaction terms and check this model for collinearity among predictors. +4. **Consider the substantive context**: Sometimes, multicollinearity reflects important aspects of your data or research question. Removing variables to reduce VIF may actually harm your analysis by omitting important confounders or by changing the interpretation of remaining coefficients (_Gregorich et al. 2021_). + +Rather than automatically removing predictors, focus on whether multicollinearity prevents you from answering your specific research question, and whether the instability in coefficient estimates is acceptable given your goals. ## Normality of residuals @@ -283,6 +296,8 @@ Cook RD. Detection of influential observation in linear regression. Technometric Cook RD and Weisberg S. Residuals and Influence in Regression. London: Chapman and Hall, 1982. +Feng X, Park DS, Liang Y, Pandey R, Papeş M. Collinearity in Ecological Niche Modeling: Confusions and Challenges. Ecology and Evolution. 2019;9(18):10365-76. doi:10.1002/ece3.5555 + Francoeur RB. Could Sequential Residual Centering Resolve Low Sensitivity in Moderated Regression? Simulations and Cancer Symptom Clusters. Open Journal of Statistics. 2013:03(06), 24-44. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, and Rubin DB. Bayesian data analysis. (Third edition). CRC Press, 2014 @@ -291,6 +306,10 @@ Gelman A, Greenland S. Are confidence intervals better termed "uncertainty inter Gelman A, and Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge; New York. Cambridge University Press, 2007 +Graham MH. Confronting Multicollinearity in Ecological Multiple Regression. Ecology. 2003;84(11):2809-15. doi:10.1890/02-3114 + +Gregorich M, Strohmaier S, Dunkler D, Heinze G. Regression with Highly Correlated Predictors: Variable Omission Is Not the Solution. International Journal of Environmental Research and Public Health. 2021;18(8):4259. doi:10.3390/ijerph18084259 + James, G., Witten, D., Hastie, T., and Tibshirani, R. (eds.).An introduction to statistical learning: with applications in R. New York: Springer, 2013 Kim, Y., & Jung, G. (2024). Understanding linear interaction analysis with causal graphs. British Journal of Mathematical and Statistical Psychology, 00, 1–14. @@ -299,6 +318,12 @@ Leys C, Delacre M, Mora YL, Lakens D, Ley C. How to Classify, Detect, and Manage McElreath, R. Statistical rethinking: A Bayesian course with examples in R and Stan. 2nd edition. Chapman and Hall/CRC, 2020 +Morrissey MB, Ruxton GD. Multiple Regression Is Not Multiple Regressions: The Meaning of Multiple Regression and the Non-Problem of Collinearity. Philosophy, Theory, and Practice in Biology. 2018;10. doi:10.3998/ptpbio.16039257.0010.003 + +O'Brien RM. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Quality & Quantity. 2007;41(5):673-90. doi:10.1007/s11135-006-9018-6 + Pek J, Wong O, Wong ACM. How to Address Non-normality: A Taxonomy of Approaches, Reviewed, and Illustrated. Front Psychol (2018) 9:2104. doi: 10.3389/fpsyg.2018.02104 +Vanhove J. Collinearity Isn't a Disease That Needs Curing. Meta-Psychology. 2021;5(April). doi:10.15626/MP.2021.2548 + Zuur AF, Ieno EN, Elphick CS. A protocol for data exploration to avoid common statistical problems: Data exploration. Methods in Ecology and Evolution (2010) 1:3-14. From bb018c00e6655ca4f8822fb2b5da54ae6fa3811d Mon Sep 17 00:00:00 2001 From: Gauarv Chaudhary Date: Fri, 31 Oct 2025 19:02:35 +0530 Subject: [PATCH 2/2] Made all the gemini recommended changes --- vignettes/check_model.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vignettes/check_model.Rmd b/vignettes/check_model.Rmd index 05f034fd7..ec61f6c03 100644 --- a/vignettes/check_model.Rmd +++ b/vignettes/check_model.Rmd @@ -233,7 +233,7 @@ Dealing with outliers is not straightforward, as it is not recommended to automa ## Multicollinearity -This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*. +This plot checks for potential collinearity among predictors. Multicollinearity occurs when predictor variables are highly correlated with each other, conditional on the other variables in the model. In other words, the information one predictor provides about the outcome is redundant in the presence of the other predictors. This should not be confused with simple pairwise correlation between predictors; what matters is the association between predictors *after accounting for all other variables in the model*. Multicollinearity can arise when a third, unobserved variable causally affects multiple predictors that are associated with the outcome. When multicollinearity is present, the model may show that individual predictors do not appear reliably associated with the outcome (yielding low estimates and high standard errors), even when these predictors are actually strongly related to the outcome (_McElreath 2020, chapter 6.1_).