This repository contains an R script (pca.r) that performs Principal Component Analysis (PCA) on datasets with:
- Reference samples (
ref) → used as a baseline group - Unknown samples (
unknown) → compared against the baseline
The script:
- Dynamically detects numeric analyte columns (names can change between runs)
- Handles missing values in reference samples by imputing the mean of the reference group for that analyte
- Optionally handles missing values in unknown samples
- Computes PCA scores and loadings
- Ranks reference samples by distance to each unknown in PCA space
Outputs:
pca_plots.pdf– multi‑page PDF:- PCA score plot (refs = grey points, unknowns = red points with labels)
- Scree plot (variance explained per PC, in numeric PC order)
- PCA loadings plot (PC1 vs PC2)
unknown_<SampleID>.csv– per‑unknown CSV containing:- The unknown sample’s analyte values
- The N closest reference samples (20 by default)
- Sorted ascending by Euclidean distance in PCA space
PCA transforms the original set of correlated variables (analytes) into a set of uncorrelated variables called principal components (PCs):
- PC1 explains the largest variance in the dataset
- PC2 explains the next largest variance, and so on
- Data is centered and scaled before PCA
Mathematically:
Z = (X − X̄) / σ
PCA: Z · W = T
Where:
- X = original analyte data
- W = eigenvectors (loadings)
- T = scores (coordinates in PC space)
Distances are computed as Euclidean distance in PCA score space:
d(a, b) = √[ Σᵢ₌₁ᵏ (PCᵢ,ₐ − PCᵢ,ᵦ)² ]
where k is the number of PCs considered (all by default).
For each unknown:
- Distances to all refs are calculated
- The closest refs are ranked and saved
PCA loadings indicate how strongly each analyte influences a PC:
- Large magnitude = greater influence on that PC
- Sign indicates the direction of the relationship
-
Data Input
- Reads
pca_data.csvfrom the script’s folder - Converts headers to lowercase
- Confirms
sampleandtypecolumns exist
- Reads
-
Pre‑processing
- Detects analyte columns automatically
- Imputes missing reference analyte values with reference means
(optional: also imputes unknown values if enabled in the script)
-
PCA Calculation
- Uses
prcomp()with scaling and centering - Extracts
scoresandloadings
- Uses
-
Output Generation
- Per unknown: Saves
unknown_<SampleID>.csvwith that unknown + N closest refs (default 20) - Generates PDF with:
- Score plot (refs = grey points, unknowns = red points with labels)
- Scree plot (% variance explained)
- Loadings plot (PC1 vs PC2)
- Per unknown: Saves
Save your dataset as pca_data.csv in the same directory as pca.r.
Required columns:
Sample→ unique sample IDType→"ref"or"unknown"- One or more numeric analyte columns
Example:
Sample,Type,As,Ba,Cd,Co,Cr
REF001,ref,1.2,0.5,0.3,0.6,0.8
REF002,ref,1.3,0.4,0.4,0.5,0.7
UNK001,unknown,0.9,0.6,0.5,0.7,0.9
- Rscript pca.r
-
pca_plots.pdf → open to see PCA score plot, scree plot, loadings plot
-
unknown_.csv → inspect analyte values and ranked closest refs for each unknown