Skip to content

DittoHK/pca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

PCA Analysis Script – README

📌 Overview

This repository contains an R script (pca.r) that performs Principal Component Analysis (PCA) on datasets with:

  • Reference samples (ref) → used as a baseline group
  • Unknown samples (unknown) → compared against the baseline

The script:

  • Dynamically detects numeric analyte columns (names can change between runs)
  • Handles missing values in reference samples by imputing the mean of the reference group for that analyte
  • Optionally handles missing values in unknown samples
  • Computes PCA scores and loadings
  • Ranks reference samples by distance to each unknown in PCA space

Outputs:

  • pca_plots.pdf – multi‑page PDF:
    1. PCA score plot (refs = grey points, unknowns = red points with labels)
    2. Scree plot (variance explained per PC, in numeric PC order)
    3. PCA loadings plot (PC1 vs PC2)
  • unknown_<SampleID>.csv – per‑unknown CSV containing:
    • The unknown sample’s analyte values
    • The N closest reference samples (20 by default)
    • Sorted ascending by Euclidean distance in PCA space

📊 Statistical Background

1. Principal Component Analysis (PCA)

PCA transforms the original set of correlated variables (analytes) into a set of uncorrelated variables called principal components (PCs):

  • PC1 explains the largest variance in the dataset
  • PC2 explains the next largest variance, and so on
  • Data is centered and scaled before PCA

Mathematically:

Z = (X − X̄) / σ

PCA: Z · W = T

Where:

  • X = original analyte data
  • W = eigenvectors (loadings)
  • T = scores (coordinates in PC space)

2. Distance Metric

Distances are computed as Euclidean distance in PCA score space:

d(a, b) = √[ Σᵢ₌₁ᵏ (PCᵢ,ₐ − PCᵢ,ᵦ)² ]

where k is the number of PCs considered (all by default).

For each unknown:

  • Distances to all refs are calculated
  • The closest refs are ranked and saved

3. PCA Loadings

PCA loadings indicate how strongly each analyte influences a PC:

  • Large magnitude = greater influence on that PC
  • Sign indicates the direction of the relationship

🖥 Code Workflow

  1. Data Input

    • Reads pca_data.csv from the script’s folder
    • Converts headers to lowercase
    • Confirms sample and type columns exist
  2. Pre‑processing

    • Detects analyte columns automatically
    • Imputes missing reference analyte values with reference means
      (optional: also imputes unknown values if enabled in the script)
  3. PCA Calculation

    • Uses prcomp() with scaling and centering
    • Extracts scores and loadings
  4. Output Generation

    • Per unknown: Saves unknown_<SampleID>.csv with that unknown + N closest refs (default 20)
    • Generates PDF with:
      1. Score plot (refs = grey points, unknowns = red points with labels)
      2. Scree plot (% variance explained)
      3. Loadings plot (PC1 vs PC2)

⚙️ Usage Instructions

1. Prepare your data

Save your dataset as pca_data.csv in the same directory as pca.r.

Required columns:

  • Sample → unique sample ID
  • Type"ref" or "unknown"
  • One or more numeric analyte columns

Example:

Sample,Type,As,Ba,Cd,Co,Cr
REF001,ref,1.2,0.5,0.3,0.6,0.8
REF002,ref,1.3,0.4,0.4,0.5,0.7
UNK001,unknown,0.9,0.6,0.5,0.7,0.9

2. Run the Script

  • Rscript pca.r

3. Check the outputs

  • pca_plots.pdf → open to see PCA score plot, scree plot, loadings plot

  • unknown_.csv → inspect analyte values and ranked closest refs for each unknown

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages