This project focuses on analyzing a dataset of lizard species (lizard.csv) to explore their life history traits in asscociation with environmental drivers using advanced statistical and machine learning techniques. The project involves Principal Component Analysis (PCA), Random Forest modeling, and unsupervised learning approaches to uncover patterns and relationships within the data.
The repository contains R Markdown files (.Rmd) for conducting and documenting the analysis, as well as associated outputs in HTML format.
- FinalAnalysis.Rmd: The main analysis script containing data cleaning, PCA, correlation matrix, random forest modeling, and visualization code. Final output is an HTML document with results.
- lizard.csv: The primary dataset containing lizard-related data, including species traits like clutch frequency, habitat type, and body size.
- Eamon'sSketch.Rmd / Rich_Sketch.Rmd: The working scripts of each team member to avoid overlapping changes while working on the script at the same time.
- Eamon'sOld.Rmd: A previous version of the working script.
- MathematicalTools.Rproj: RStudio project file to organize and manage the working directory for this analysis.
- Clobert et al., 1998.pdf: A relevant reference used to guide the project.
This project applies several statistical and machine learning techniques using the R programming language:
- Data Cleaning: Removal of duplicates and handling missing values. Conversion of variables to appropriate data types for analysis.
- Principal Component Analysis (PCA): Reduces dimensionality and highlights key variables influencing species differences. Visualized using biplots and scree plots.
- Random Forest Modeling: Predicts clutch frequency based on species traits. Evaluates model accuracy and identifies important predictors.
- Correlation Analysis: Examines relationships between numeric variables to identify redundancy or strong correlations.
- Unsupervised Learning: Clustering techniques to explore hidden patterns and groupings within the data.
- RStudio: For running R Markdown files and managing the project.
- LaTeX (optional): For generating PDF outputs from
.Rmdfiles.
The following libraries are used in the analysis and must be installed:
install.packages(c("tidyverse", "corrplot", "FactoMineR", "factoextra",
"vegan", "ggplot2", "rsample", "rpart", "rpart.plot",
"randomForest", "tibble", "tidyr", "gridExtra", "factoextra",
"caret", "cluster"))
- Clone the repository and open the
MathematicalTools.Rprojfile in RStudio. - Ensure all required packages are installed.
- Open
Analysis.Rmd. - Knit the file to generate an HTML or PDF report with results.
- View output file
Analysis.nb.htmlfor detailed results and visualizations.
- Data Cleaning: Ensures high-quality data by removing duplicates, converting variables to appropriate types, and handling missing values effectively.
- Correlation Analysis: Identifies relationships and redundancy among numeric variables, guiding feature selection and interpretation.
- Principal Component Analysis (PCA): Highlights the key variables influencing species differences and reduces dimensionality for clearer visualization and clustering.
- Clustering Analysis: Uses unsupervised learning methods to uncover hidden patterns and groupings within the data.
- Random Forest Modeling: Develops a predictive model for clutch frequency, evaluates model accuracy, and identifies the most important predictors influencing the outcome.
- Visualization: Generates informative plots, including correlation heatmaps, PCA biplots, and Random Forest variable importance charts, to present findings clearly and effectively.
- Fork the repository and create a new branch.
- Make your changes or additions (e.g., improve code, add new analyses).
- Submit a pull request with a description of your changes.
- Clobert et al., 1998: Referenced as part of the biological context for the analysis.
- Teaching and guidance from:
- Eric Macron (R Markdown)
- Lucia Clarotto (Mathematical Tools)
- Reseracher X (LaTex)
- Reseracher Y (Zotero)