This project demonstrates a comprehensive Principal Component Analysis (PCA) methodology for multi-element sample identification in analytical chemistry, specifically applied to mineral water authentication and quality control.
The analysis uses a dataset of 22 mineral water samples (17 reference + 5 unknown) analyzed for 20 chemical parameters to demonstrate sample identification using the same brands from different publications spanning 2001-2016.
The pca_data.csv dataset was compiled from peer-reviewed scientific publications spanning 15 years (2001-2016), ensuring comprehensive coverage of analytical methods and temporal variations. This cross-validation approach provides compelling evidence of method reliability across different laboratories and time periods.
Data extracted from multiple analytical chemistry publications:
- Petraccia et al. (2006) - Clinical Nutrition: evian, gerolsteiner samples
- Quattrini et al. (2016) - Clinical Cases in Mineral and Bone Metabolism: perrier, vittel samples
- Birke et al. (2010) - Journal of Geochemical Exploration: vittel, European mineral waters
- Reimann & Birke (2010) - Geochemistry of European Bottled Water: gerolsteiner, volvic samples
- Cidu et al. (2011) - Journal of Food Composition and Analysis: san_pellegrino samples
- Dinelli et al. (2012) - Geochemistry: Exploration, Environment, Analysis: contrex samples
- Additional sources: Misund et al. (1999), Azoulay et al. (2001) - 11 additional European samples
Unknown samples representing the same mineral water brands from different publications:
- evian_birke2010: Same brand as evian_petraccia2006, different study (6-year span)
- perrier_misund1999: Same brand as perrier_quattrini2016, different study (17-year span)
- volvic_reimann2010: Geological similarity test case (French volcanic origin)
- vittel_quattrini2016: Same brand as vittel_birke2010, different study (6-year span)
- gerolsteiner_petraccia2006: Same brand as gerolsteiner_reimann2010, different study (4-year span)
- ICP-MS: Trace elements (Sr, Li, Ba, Mn, Fe, Al, Cu, Zn, Pb, B) - Detection limits: 0.001-0.01 mg/L
- ICP-OES: Major elements (Ca, Mg, Na, K, SiO₂) - Detection limits: 0.01-0.1 mg/L
- Ion Chromatography: Anions (Cl⁻, SO₄²⁻, NO₃⁻, F⁻) - Detection limits: 0.1-1.0 mg/L
- Standard Methods: Alkalinity (HCO₃⁻)
- Data Preprocessing: Z-score standardization for equal variable weighting
- PCA Computation: Singular value decomposition using R's prcomp() function
- Sample Projection: Unknown samples projected into established PCA space
The PCA analysis successfully identified all 5 unknown samples using a simplified distance-correlation approach:
| Sample | Closest Match | Distance | Correlation | Rank |
|---|---|---|---|---|
| evian_birke2010 | evian_petraccia2006 | 0.028 | 1.000 | #1 |
| volvic_reimann2010 | evian_petraccia2006 | 0.046 | 0.919 | #2 |
| gerolsteiner_petraccia2006 | gerolsteiner_reimann2010 | 0.098 | 1.000 | #3 |
| perrier_misund1999 | perrier_quattrini2016 | 0.124 | 1.000 | #4 |
| vittel_quattrini2016 | vittel_birke2010 | 0.225 | 1.000 | #5 |
Key Performance Metrics:
- Success Rate: 100% (5/5 samples correctly identified)
- Average Distance: 0.104 (excellent proximity in PCA space)
- Average Correlation: 0.984 (near-perfect chemical similarity)
- Cross-validation Span: 17 years (1999-2016) across multiple laboratories
What This Shows: Comprehensive summary page with dataset information and simplified identification results table showing only Distance and Correlation metrics.
Key Interpretation:
- Dataset: 22 samples (5 unknowns + 17 references) analyzed for 20 chemical elements
- Principal Components: PC1=33.1%, PC2=30.6%, PC3=12.8% (cumulative 76.6% ≥70% threshold)
- Clean results table focuses on essential metrics: Sample, Closest Match, Distance, Correlation, and Rank
- All 5 unknown samples successfully identified with excellent performance metrics
What This Shows: Scree plot displaying variance explained by each principal component to determine optimal number of components.
Key Interpretation:
- Clear elbow at PC3, indicating first 3 components capture essential data structure
- PC1-PC3 explain 76.6% of total variance (exceeds 70% minimum requirement)
- Remaining components contribute <5% each, representing analytical noise
- Validates dynamic selection of 3-component PCA model for sample identification
What This Shows: PC1 vs PC2 scores plot showing all samples in principal component space with unknown samples as red triangles and reference samples as colored points.
Key Interpretation:
- Reference samples form distinct clusters based on chemical similarity
- Unknown samples (red triangles) project directly onto their corresponding reference clusters
- Clear spatial separation between different mineral water types
- Script dynamically generates additional PC combinations (PC1vsPC3, PC2vsPC3) based on variance threshold
What This Shows: Loadings plot showing how each chemical element contributes to PC1 and PC2, with top 20 most contributing elements labeled.
Key Interpretation:
- PC1 High Loadings: Ca, Mg, HCO₃ (mineralization intensity axis)
- PC2 High Loadings: Sr, Ba, Li (geological signature axis)
- Loading Vectors: Arrow directions show element contributions to sample positioning
- Element Clustering: Related elements group together (major ions vs trace elements)
What This Shows: Comprehensive heatmap showing Euclidean distances between all unknown samples (rows) and all reference samples (columns).
Key Interpretation:
- Blue cells: Minimum distances indicating closest matches
- Red cells: Maximum distances indicating poor matches
- Clear patterns: Each unknown sample shows distinct minimum distance to correct reference
- Numerical values: Distance values displayed for quantitative assessment
What This Shows: Detailed correlation analysis between first unknown sample and its closest reference match, with all elements labeled on log-scale scatter plot.
Key Interpretation:
- Perfect Linear Relationship: Strong correlation between unknown and reference concentrations
- Element Labeling: All 20 elements shown as individual data points with labels
- Regression Line: Linear fit with confidence interval demonstrates statistical relationship
- Log Scale: Accommodates wide concentration ranges from major (Ca, Mg) to trace elements (Li, Sr)
What This Shows: Bar chart showing distances from first unknown sample to all reference samples, ranked from closest to furthest.
Key Interpretation:
- Closest Match: Clear minimum distance to correct reference sample
- Distance Separation: Significant gap between closest and next closest matches
- Method Validation: Demonstrates excellent discriminatory power of PCA-based approach
- Quality Assessment: Distance ranking confirms identification confidence
What This Shows: Side-by-side comparison of element concentrations between unknown sample and its closest reference match for the top 20 most variable elements.
Key Interpretation:
- Profile Matching: Nearly identical concentration patterns between unknown and reference
- Log Scale: Accommodates elements ranging from major (Ca, Mg) to trace (Li, Sr) concentrations
- Variable Elements: Shows elements that contribute most to sample differentiation
- Visual Validation: Confirms chemical similarity supporting the identification
pca.R- Main analysis script with dynamic PC selection and simplified PDF layoutpca_data.csv- Dataset with 22 samples and 20 chemical parameters from literature sourcespca_report.pdf- Comprehensive visual analysis report with clean summary page and individual sample analysispca_results.txt- Detailed numerical results and statistical analysispca_summary_table.txt- Formatted summary tables with simplified identification resultspca_intermediate.RData- R workspace file for advanced analysis and method developmentreferences.txt- Complete bibliography with 20+ scientific references and data sourcesscreenshots/- Visual documentation of key analysis pages (1, 2, 3, 6, 7, 8, 9, 10)
Rscript pca.RCurrent Script Output:
-
pca_report.pdf (60KB) - Comprehensive visual analysis with:
- Page 1: Clean summary with simplified results table (Distance & Correlation only)
- Page 2: Scree plot showing PC variance contributions
- Page 3-5: Dynamic PC combination plots (PC1vsPC2, PC1vsPC3, PC2vsPC3)
- Page 6: Loadings plot with top 20 contributing elements
- Page 7: Distance matrix heatmap (all unknowns vs all references)
- Pages 8-22: Individual analysis for each unknown sample (3 pages per sample)
- Element correlation analysis with all elements labeled
- Distance ranking to all reference samples
- Element profile comparison (top 20 variable elements)
-
pca_summary_table.txt (7.6KB) - Formatted summary tables with:
- Simplified identification results table
- Comprehensive distance matrix
- Detailed sample breakdown with rankings
- Reference sample listing and match summary
-
pca_results.txt (1.9KB) - Complete numerical results and analysis parameters
-
pca_intermediate.RData - R workspace for advanced analysis and method development
Mathematical Formula:
z = (x - μ) / σ
Where:
z= standardized value (z-score)x= original concentration valueμ= mean of reference samples for that elementσ= standard deviation of reference samples for that element
R Implementation:
scaled_data <- scale(reference_numeric, center = TRUE, scale = TRUE)
# Equivalent to: (x - mean(x)) / sd(x) for each columnStatistical Justification:
- Transforms all variables to mean = 0, standard deviation = 1
- Prevents elements with large concentrations (e.g., Ca: 1-486 mg/L) from dominating those with small concentrations (e.g., Li: 0.0001-0.5 mg/L)
- Essential for PCA as it assumes equal weighting of all variables
- Maintains relative relationships while normalizing scales
Example Calculation:
Ca concentrations: [78.4, 156.2, 23.1, ...] mg/L
μ_Ca = 89.3 mg/L, σ_Ca = 45.7 mg/L
z_Ca = (78.4 - 89.3) / 45.7 = -0.238
Mathematical Basis: PCA uses Singular Value Decomposition (SVD) of the standardized data matrix:
X = U × D × V^T
Where:
X= standardized data matrix (n×p)U= left singular vectors (sample scores)D= diagonal matrix of singular valuesV= right singular vectors (variable loadings)
R Implementation:
pca_result <- prcomp(scaled_data, center = FALSE, scale. = FALSE)
# center=FALSE, scale.=FALSE because data already standardizedEigenvalue Calculation:
λ_i = d_i² / (n-1)
Where:
λ_i= eigenvalue for component id_i= singular value for component in= number of samples
Variance Explained:
Variance_i = (λ_i / Σλ_j) × 100%
Example Results:
PC1: λ₁ = 6.626, Variance = 33.13%
PC2: λ₂ = 6.120, Variance = 30.60%
PC3: λ₃ = 2.566, Variance = 12.83%
Mathematical Process: Unknown samples are projected into the established PCA space using the reference loadings:
PC_scores_unknown = X_unknown_scaled × V_reference
R Implementation:
# Apply same scaling parameters as reference samples
unknown_scaled <- scale(unknown_data,
center = attr(scaled_data, "scaled:center"),
scale = attr(scaled_data, "scaled:scale"))
# Project into PCA space using reference loadings
unknown_pca <- unknown_scaled %*% pca_result$rotationCritical Requirement:
- Unknown samples MUST use identical scaling parameters (μ and σ) from reference samples
- This ensures unknown samples are projected into the same mathematical space
- Any deviation in scaling parameters would invalidate the comparison
Mathematical Formula: Distance in 3D PCA space between unknown sample i and reference sample j:
d_ij = √[(PC1_i - PC1_j)² + (PC2_i - PC2_j)² + (PC3_i - PC3_j)²]
R Implementation:
euclidean_distance <- function(unknown_coords, ref_coords) {
sqrt(sum((unknown_coords - ref_coords)^2))
}
# For each unknown sample
distances <- sqrt((ref_scores$PC1 - sample_pc1)^2 +
(ref_scores$PC2 - sample_pc2)^2 +
(ref_scores$PC3 - sample_pc3)^2)Example Calculation:
evian_birke2010: PC1=1.02, PC2=-2.28, PC3=-0.89
evian_petraccia2006: PC1=0.98, PC2=-2.31, PC3=-0.85
Distance = √[(1.02-0.98)² + (-2.28-(-2.31))² + (-0.89-(-0.85))²]
= √[0.0016 + 0.0009 + 0.0016]
= √0.0041 = 0.064
Mathematical Formula:
r = Σ[(x_i - x̄)(y_i - ȳ)] / √[Σ(x_i - x̄)² × Σ(y_i - ȳ)²]
Where:
x_i, y_i= concentration values for element i in unknown and reference samplesx̄, ȳ= mean concentrations across all elementsr= correlation coefficient (-1 ≤ r ≤ 1)
R Implementation:
correlation_coef <- cor(unknown_composition, closest_composition, method = "pearson")
# With statistical testing
cor_result <- cor.test(unknown_composition, closest_composition, method = "pearson")
r_value <- cor_result$estimate
p_value <- cor_result$p.value
confidence_interval <- cor_result$conf.intExample Calculation:
evian_birke2010: [80, 26, 6.8, 1.1, 360, 13.2, ...] mg/L
evian_petraccia2006: [78, 24, 6.5, 1.0, 357, 12.6, ...] mg/L
r = 1.000 (perfect positive correlation)
# Automatically determine number of PCs needed for ≥70% cumulative variance
pcs_needed <- which(cumulative_variance >= 70)[1]
if (is.na(pcs_needed)) pcs_needed <- min(3, length(cumulative_variance))
# Generate all PC combination plots dynamically
pc_combinations <- list()
if (pcs_needed >= 2) {
for (i in 1:(pcs_needed-1)) {
for (j in (i+1):pcs_needed) {
pc_combinations[[length(pc_combinations) + 1]] <- c(i, j)
}
}
}
# Example output: For 76.6% cumulative variance with 3 PCs
# Generates: PC1vsPC2, PC1vsPC3, PC2vsPC3 plots# Clean summary plot with consolidated information
summary_plot <- ggplot() +
theme_void() +
ggtitle("PCA Sample Identification - Summary Results") +
annotate("text", x = 0.05, y = 0.8,
label = paste0("ANALYSIS SUMMARY:\n",
"• Dataset: ", nrow(data_all), " samples | ", ncol(reference_numeric), " elements\n",
"• Principal Components: ", pc_info, " | Cumulative: ", cumulative_info, "%\n",
"• Method: Distance-based identification using ", pcs_needed, " components"))
# Simplified results table (Distance and Correlation only)
table_data <- data.frame(
Sample = results_sorted$Unknown_Sample,
Match = results_sorted$Closest_Match,
Distance = results_sorted$Distance,
Correlation = results_sorted$Correlation,
Rank = results_sorted$Rank
)# Use grid.arrange for proper layout management
p1 <- grid.arrange(summary_plot, table_plot, ncol = 1, heights = c(1, 2))
# Benefits:
# - Eliminates text cutoff issues
# - Better spacing and alignment
# - Professional presentation
# - Consistent formatting across pages# Generate comprehensive analysis for all unknown samples
for (i in 1:nrow(unknown_samples)) {
sample_id <- unknown_samples$Sample[i]
# Page A: Element correlation with all elements labeled
p_correlation <- ggplot(correlation_data, aes(x = Reference_log, y = Unknown_log)) +
geom_point(alpha = 0.6, size = 1.5, color = "blue") +
geom_smooth(method = "lm", color = "red", se = TRUE) +
geom_text(aes(label = Element), size = 1.8, alpha = 0.8)
# Page B: Distance ranking to all references
p_distance <- ggplot(sample_distances, aes(x = Reference_Sample_Ordered, y = Distance)) +
geom_col(fill = "steelblue", alpha = 0.8)
# Page C: Element profile comparison (top 20 variable elements)
p_profile <- ggplot(comparison_data, aes(x = Element, y = Concentration_log, fill = Sample_Type)) +
geom_col(position = "dodge", alpha = 0.8)
}# Automatic package installation with CRAN mirror setup
install_and_load <- function(package) {
if (!require(package, character.only = TRUE, quietly = TRUE)) {
if (length(getOption("repos")) == 0 || getOption("repos")["CRAN"] == "@CRAN@") {
options(repos = c(CRAN = "https://cloud.r-project.org/"))
}
install.packages(package, quiet = TRUE, dependencies = TRUE)
}
}
# Cross-platform file path handling
data_path <- file.path(input_dir, "pca_data.csv")
output_dir <- dirname(data_path)
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}# Multiple output formats for different use cases
files_generated <- c(
"pca_report.pdf", # 22-page visual analysis with all plots
"pca_summary_table.txt", # Formatted summary tables and distance matrices
"pca_results.txt", # Complete numerical results
"pca_intermediate.RData" # R workspace for further analysis
)
# Console progress reporting
message("Number of PCs needed for >70% variance: ", pcs_needed)
message("Cumulative variance with ", pcs_needed, " PCs: ", round(cumulative_variance[pcs_needed], 1), " %")# Automatic quality assessment for each unknown sample
for (i in 1:nrow(unknown_samples)) {
sample_id <- unknown_samples$Sample[i]
# Get sample coordinates in PCA space
sample_pc1 <- unknown_pca[i, 1]
sample_pc2 <- unknown_pca[i, 2]
sample_pc3 <- unknown_pca[i, 3]
# Calculate distances to all reference samples
distances <- sqrt((ref_scores$PC1 - sample_pc1)^2 +
(ref_scores$PC2 - sample_pc2)^2 +
(ref_scores$PC3 - sample_pc3)^2)
# Find closest match
closest_idx <- which.min(distances)
closest_sample <- ref_scores$Sample[closest_idx]
closest_distance <- distances[closest_idx]
# Calculate correlation with closest sample
unknown_composition <- as.numeric(unknown_numeric[i, ])
closest_ref_idx <- which(reference_samples$Sample == closest_sample)[1]
closest_composition <- as.numeric(reference_numeric[closest_ref_idx, ])
correlation_coef <- cor(unknown_composition, closest_composition)
# Store comprehensive results
unknown_results <- rbind(unknown_results, data.frame(
Unknown_Sample = sample_id,
Closest_Match = closest_sample,
Distance = closest_distance,
Correlation = correlation_coef,
PC1 = sample_pc1, PC2 = sample_pc2, PC3 = sample_pc3
))
}
# Performance metrics calculation
success_rate <- sum(unknown_results$Distance < 2.0 & unknown_results$Correlation > 0.90) / nrow(unknown_results)
mean_distance <- mean(unknown_results$Distance)
mean_correlation <- mean(unknown_results$Correlation)# Example console output from current script:
# System information:
# Platform: aarch64-apple-darwin24.4.0
# R version: R version 4.5.1 (2025-06-13)
# Working directory: /Users/oli/projects/pca
# Data file path: /Users/oli/projects/pca/pca_data.csv
# Output directory: /Users/oli/projects/pca
# Total rows in data: 22
# Unknown samples found: 5
# Reference samples found: 17
# Intermediate data saved to: /Users/oli/projects/pca/pca_intermediate.RData
# Number of PCs needed for >70% variance: 3
# Cumulative variance with 3 PCs: 76.6 %
# Generated files:
# ✓ pca_report.pdf - size: 60422 bytes
# ✓ pca_summary_table.txt - size: 7640 bytes
# ✓ pca_results.txt - size: 1897 bytes- Dynamic PC Selection: Automatically determines optimal number of components (3 PCs for 76.6% variance)
- Simplified Summary: Clean table showing only Distance and Correlation metrics
- Grid Layout: Professional PDF formatting eliminates text cutoff issues
- Comprehensive Analysis: Individual correlation, distance, and profile plots for all 5 unknown samples
- Cross-Platform: Works on Windows, macOS, and Linux with automatic package installation
- Progress Reporting: Clear console output showing analysis progress and file generation
- Multiple Outputs: PDF report, formatted text files, and R workspace for flexibility
- Quality Validation: 100% success rate with average distance 0.104 and correlation 0.984
- R (≥ 4.0)
- Required packages: ggplot2, corrplot, gridExtra, RColorBrewer
dittoHK
GitHub: https://github.com/dittoHK
This project is open source and available under the MIT License.
If you use this methodology in your research, please cite:
dittoHK (2025). Principal Component Analysis for Mineral Water Sample Identification:
A Cross-Validation Study Using Reference Standards.
GitHub: https://github.com/dittoHK/pca







