Principal Component Analysis (PCA)

Introduction

Principal component analysis (PCA) reduces the number of dimensions in large datasets to principal components that retain most of the original information. It does this by transforming potentially correlated variables into a smaller set of uncorrelated variables, called principal components.

Let \(X\) the matrix \((n,p)\) of data, we note \(X_c\) = the centered matrix. Then the empirical variances, covariances matrix is \(C = \frac{1}{n}X_c^TX_c\). We note \(\Lambda\) the vector of the eigen value (in decrease order) of the matrix \(C\) and \(U\) the \((p,p)\) orthogonal matrix of the eigen vectors : \[C=U\texttt{diag}(\Lambda) U^T.\] Then We have

  • The coordinates of the \(n\) observations in the new basis of the eigen vectors \((\vec{u}_1,\ldots,\vec{u}_p)\) are \[\Psi = X_cU\]
  • The The coordinates of the \(p\) variables in the new basis of the eigen vectors \((\vec{v}_1,\ldots,\vec{v}_p)\) are \[\Phi = \sqrt{n}U\texttt{diag}(\sqrt{\lambda_1},\ldots,\sqrt{\lambda}_p)\]
  • The total inertia (variance) is \[I = \texttt{trace}(C)=\sum_{i=1,p}\lambda_i\]
  • The variance of the variable \(v_i\) is \(\lambda_i\)

Iris Data

Code
using Plots
using Statistics, LinearAlgebra
using RDatasets, DataFrames
iris = RDatasets.dataset("datasets", "iris")  # Iris Datas
Names = names(iris)
X = Matrix(iris[:,1:4])
p1 = scatter(X[:,1],X[:,2], c=[:blue :red :green], group=iris.Species,xlabel = Names[1],ylabel=Names[2])
p2 = scatter(X[:,3],X[:,4], c=[:blue :red :green], group=iris.Species,xlabel = Names[3],ylabel=Names[4])
plot(p1,p2)
[ Info: Precompiling FileIOExt [f5f51d8f-5827-5d2e-939b-192fcd6ec70c] (cache misses: wrong dep version loaded (2), wrong source (2))

My PCA function

Code
using LinearAlgebra, Statistics
function my_PCA(X::Matrix{<:Real};normed=false)::Tuple{Vector{<:Real},Matrix{<:Real},Matrix{<:Real},Matrix{<:Real},Real,Vector{<:Real},Matrix{<:Real}}
"""
    Compute the PCA of Data
    Input
    X : (n,p) Matrix of reals
         n = number of observations
         p = number of variables
    Output
        Λ : Vector of the p eigen value in decrease order
        U : (p,p) Matrix of reals
            eigen vectors in column
        Ψ : (n,p) Matrix of reals
            Coordinates of the observation in the new basis
        Φ = (p,p) Matrix of reals
             Coordinates of the variables in the new basis
        I_total : Real
             total inertia
        cum_var_ratio : p vector of reals
             cumulative variance ratio
"""
     n,p = size(X)
     Λ = zeros(p); U = zeros(p,p); Ψ = zeros(n,p); Φ = zeros(p,p); I_total=0; cum_var_ratio = zeros(p)
     # Calculation of centered data
     xbar = mean(X,dims=1)
     Xc = X - ones(n,1)*xbar
     covMat = (1/n)*Xc'*Xc
     if normed == true
     s=std(Xc,corrected=false,dims=1)
     Y=(Xc)./(ones(n,1)*s);
     covMat=(1/n)*Y'*Y
     end


     # Computating total inertia
     I_total = tr(covMat)
     Λ, U = eigen(covMat)
     eigOrder = sortperm(Λ, rev = true) # for abtaining increase order of eigen values
     Λ = Λ[eigOrder]
     # cumulative variance ratios
     cum_var_ratio =  Vector{Float64}(undef,p)
     for i in 1:p
         cum_var_ratio[i] = sum(Λ[1:i])/I_total
     end
     U = U[:,eigOrder]
     if normed == true
       Ψ = Y*U
       Φ = U*sqrt.(diagm(Λ))
     else
       Ψ = Xc*U
       Φ = U*sqrt.(n*diagm(Λ))
     end
     return Λ, U, Ψ, Φ, I_total,cum_var_ratio, covMat
end
my_PCA (generic function with 1 method)

Graph of the observations

Code
p3 = scatter(my_Ψ[:,1],my_Ψ[:,2], c=[:blue :red :green], group=iris.Species,xlabel = "PC1", ylabel="PC2")
p4 = scatter(my_Ψ[:,3],my_Ψ[:,4], c=[:blue :red :green], group=iris.Species,xlabel = "PC1", ylabel="PC2")
plot(p3,p4)

Graph of the variables

Code
pvar1 = plot()
ech = 1.1*maximum(abs.(my_Φ))
for i=1:4
    plot!(pvar1,[0,my_Φ[i,1]], [0,my_Φ[i,2]], xlims=(-ech,ech), ylims=(-ech,ech), arrow=true, label=Names[i], legend=:bottomleft, xlabel="v1", ylabel="v2")
end

pvar2 = plot()
for i=1:4
    plot!(pvar2,[0,my_Φ[i,3]], [0,my_Φ[i,4]], xlims=(-ech,ech), ylims=(-ech,ech), arrow=true, label=Names[i], legend=:bottomleft, xlabel="v3", ylabel="v4")
end
plot(pvar1,pvar2)

With MultivariateStats package

Code
using MultivariateStats

model = fit(PCA, X', maxoutdim=4, pratio = 0.999)  # Each column of X is an observation
U = projection(model)
println("U = ")
display(U)
Ψ = MultivariateStats.transform(model, X')
println("Ψ = ")
display(Ψ)
display'-my_Ψ) # Each column of Ψ is an observation
display(U-my_U)
U = 
4×4 Matrix{Float64}:
 -0.361387    0.656589   0.58203     0.315487
  0.0845225   0.730161  -0.597911   -0.319723
 -0.856671   -0.173373  -0.0762361  -0.479839
 -0.358289   -0.075481  -0.545831    0.753657
Ψ = 
4×150 Matrix{Float64}:
 2.68413      2.71414     2.88899    …  -1.76435    -1.90094    -1.39019
 0.319397    -0.177001   -0.144949       0.0788589   0.116628   -0.282661
 0.0279148    0.210464   -0.0179003     -0.130482   -0.723252   -0.36291
 0.00226244   0.0990266   0.0199684      0.137001    0.0445953  -0.155039
150×4 Matrix{Float64}:
 -4.44089e-16  -1.66533e-16  -3.29597e-16   2.23346e-16
 -4.44089e-16  -5.55112e-17   2.22045e-16  -3.33067e-16
 -4.44089e-16  -1.38778e-16  -1.07553e-16   5.58581e-16
 -4.44089e-16  -2.77556e-16  -4.3715e-16    6.66134e-16
 -4.44089e-16  -2.22045e-16  -5.55112e-16   7.77156e-16
 -4.44089e-16  -1.11022e-16  -6.38378e-16   6.45317e-16
 -4.44089e-16  -2.22045e-16  -4.44089e-16   1.54043e-15
  0.0          -1.66533e-16  -4.4062e-16    2.22045e-16
 -4.44089e-16  -2.22045e-16  -2.18575e-16   7.80626e-16
 -4.44089e-16  -1.94289e-16  -3.33067e-16  -3.26128e-16
 -4.44089e-16   0.0          -4.44089e-16  -2.2031e-16
  0.0          -1.8735e-16   -9.99201e-16   7.77156e-16
 -4.44089e-16  -5.55112e-17  -3.33067e-16  -3.20924e-16
  ⋮                                        
  2.22045e-16  -2.77556e-17   2.77556e-16   9.88792e-16
  4.44089e-16   1.11022e-16   1.05818e-15  -3.60822e-16
  0.0          -8.32667e-17   1.38778e-15   8.88178e-16
  0.0           1.66533e-16   2.28983e-15   0.0
  2.22045e-16  -1.11022e-16   2.77556e-16   1.14492e-15
  4.44089e-16   5.55112e-17   3.33067e-16   6.52256e-16
  0.0           5.55112e-17   1.22125e-15   1.52656e-15
  2.22045e-16   1.66533e-16   1.88738e-15   3.33067e-16
  2.22045e-16   2.22045e-16   1.33227e-15  -4.996e-16
  0.0           1.249e-16     7.77156e-16   2.498e-16
  2.22045e-16  -1.11022e-16   4.44089e-16   2.53964e-15
  2.22045e-16  -1.11022e-16  -3.88578e-16   1.38778e-15
4×4 Matrix{Float64}:
  5.55112e-17   3.33067e-16   1.11022e-15  -2.66454e-15
  1.38778e-17  -3.33067e-16  -1.55431e-15   2.16493e-15
  3.33067e-16  -2.22045e-16  -1.91513e-15   2.77556e-16
 -3.88578e-16   2.22045e-16   3.33067e-15   2.10942e-15

Normed PCA

Introduction

  • \(X\) the matrix \((n,p)\) of data

  • \(X_c\) is the centered matrix

  • \(Y\) is the centered and normed matrix. Each column of \(Xc\) is divided by its sample standard deviation

  • \(R=Y^TY\) is the corretalion matrix of the Data \(X\).

  • \[C=U\texttt{diag}(\Lambda) U^T.\] Then We have

  • The coordinates of the \(n\) observations in the new basis of the eigen vectors \((\vec{u}_1,\ldots,\vec{u}_p)\) are \[\Psi = YU\]

  • The The coordinates of the \(p\) variables in the new basis of the eigen vectors \((\vec{v}_1,\ldots,\vec{v}_p)\) are \[\Phi = U\texttt{diag}(\sqrt{\lambda_1},\ldots,\sqrt{\lambda}_p)\]

  • The total inertia (variance) is \[I = \texttt{trace}(C)=\sum_{i=1,p}\lambda_i\]

  • The variance of the variable \(v_i\) is \(\lambda_i\)

Data

Code
# Data from Tomassone page 138 : mineral waters
X =[341   27   3   84   23   2  
263   23   9   91   5   3  
287   3   5   44   24   23  
   298   9   23   96   6   11 
    200   15   8   70   2   4 
    250   5   20   71   6   11 
   357   10   2   78   24   5 
      311   14   18   73   18   13 
    256   6   23   86   3   18  
   186   10   16   64   4   9 
    183   16   44   48   11   31 
       398   218   15   157   35   8 
      348   51   31   140   4   14 
   168   24   8   55   5   9 
   110   65   5   4   1   3 
   332   14   8   103   16   5 
      196   18   6   58   6   13 
       59   7   6   16   2   9 
       402   306   15   202   36   3 
       64   7   8   10   6   8 ]

df = DataFrame(X,[:HCO3, :SO4, :Cl, :Ca, :Mg, :Na])
df[:, :Origins] = ["Aix-les-Bains", "Beckerish",
"Cayranne", 
"Chambon",
"Cristal-Roc",
"St Cyr",
"Evian",
"Ferita",
"St Hyppolite",
"Laurier", 
"Ogeu",
"Ondine",
"Perrier",
"Ribes", 
"Spa",
"Thonon", 
"Veri", 
"Viladreau",
"Vittel", 
"Volvic"]
20-element Vector{String}:
 "Aix-les-Bains"
 "Beckerish"
 "Cayranne"
 "Chambon"
 "Cristal-Roc"
 "St Cyr"
 "Evian"
 "Ferita"
 "St Hyppolite"
 "Laurier"
 "Ogeu"
 "Ondine"
 "Perrier"
 "Ribes"
 "Spa"
 "Thonon"
 "Veri"
 "Viladreau"
 "Vittel"
 "Volvic"

Graph of the observations

Code
my_PCA_results = my_PCA(X,normed=true)
my_Λ, my_U, my_Ψ, my_Φ, my_I_total, my_cum_var_ratio, my_R = my_PCA_results
p3 = scatter(my_Ψ[:,1],my_Ψ[:,2],xlabel = "PC1", ylabel="PC2")
p4 = scatter(my_Ψ[:,3],my_Ψ[:,4], xlabel = "PC1", ylabel="PC2")
plot(p3,p4)

println(sum(my_Φ.^2,dims=1))
println(sum(my_Φ.^2,dims=2))
println("my_Φ*my_Φ' = ")
display(my_Φ*my_Φ')
println("Matrix of correlation = ")
display(my_R)
println("my_Φ'*my_Φ = ")
display(my_Φ'*my_Φ)
[3.0940874688062925 1.6875603215911315 0.5965131904043633 0.5028441637469864 0.09323922489032904 0.025755630560899054]
[0.9999999999999994; 0.9999999999999997; 1.0000000000000007; 0.9999999999999999; 1.0000000000000007; 1.0000000000000018;;]
my_Φ*my_Φ' = 
6×6 Matrix{Float64}:
  1.0        0.47796     0.121776    0.851749   0.730575   -0.108825
  0.47796    1.0         0.0449271   0.732646   0.670631   -0.278783
  0.121776   0.0449271   1.0         0.252035  -0.125463    0.668014
  0.851749   0.732646    0.252035    1.0        0.605682   -0.196195
  0.730575   0.670631   -0.125463    0.605682   1.0        -0.0905507
 -0.108825  -0.278783    0.668014   -0.196195  -0.0905507   1.0
Matrix of correlation = 
6×6 Matrix{Float64}:
  1.0        0.47796     0.121776    0.851749   0.730575   -0.108825
  0.47796    1.0         0.0449271   0.732646   0.670631   -0.278783
  0.121776   0.0449271   1.0         0.252035  -0.125463    0.668014
  0.851749   0.732646    0.252035    1.0        0.605682   -0.196195
  0.730575   0.670631   -0.125463    0.605682   1.0        -0.0905507
 -0.108825  -0.278783    0.668014   -0.196195  -0.0905507   1.0
my_Φ'*my_Φ = 
6×6 Matrix{Float64}:
  3.09409       3.94656e-16   1.3417e-16   …   9.19712e-17   5.2522e-17
  3.94656e-16   1.68756       5.89447e-16      8.47945e-17  -7.23998e-18
  1.3417e-16    5.89447e-16   0.596513        -2.69255e-17  -6.3729e-18
 -1.54354e-16   3.39583e-16   3.4386e-16      -1.29052e-17   8.19281e-18
  9.19712e-17   8.47945e-17  -2.69255e-17      0.0932392     4.41758e-18
  5.2522e-17   -7.23998e-18  -6.3729e-18   …   4.41758e-18   0.0257556

Graph of the variables

Code
n,p = size(X)
println("p = ", p)
Names = names(df)
println(Names)
pvar1 = plot()
ech = 1.1*maximum(abs.(my_Φ))
for i=1:p
    plot!(pvar1,[0,my_Φ[i,1]], [0,my_Φ[i,2]], xlims=(-ech,ech), ylims=(-ech,ech), arrow=true, label=Names[i], legend=:bottomleft, xlabel="v1", ylabel="v2")
end

# Plot the unit cercle
cercle(θ)=[cos.(θ) sin.(θ)]
Inter_theta = 0:0.01:2*π
Cercle = cercle(Inter_theta)
plot!(pvar1,Cercle[:,1],Cercle[:,2])

pvar2 = plot()
for i=1:p
    plot!(pvar2,[0,my_Φ[i,3]], [0,my_Φ[i,4]], xlims=(-ech,ech), ylims=(-ech,ech), arrow=true, label=Names[i], legend=:bottomleft, xlabel="v3", ylabel="v4")
end
plot!(pvar2,Cercle[:,1],Cercle[:,2])
plot(pvar1,pvar2,aspect_ratio=:equal)
p = 6
["HCO3", "SO4", "Cl", "Ca", "Mg", "Na", "Origins"]

Colored Image to the best black and white

Data

Code
using Plots
using TestImages, Images
img = testimage("lighthouse")
imgarr = channelview(img)
plot(img)
println(imgarr[1,1:3,1:3])
img_red = StackedView(imgarr[1,:,:], zeroarray, zeroarray)
img2_red = colorview(RGB, img_red)
img_green = StackedView(zeroarray,imgarr[2,:,:], zeroarray)
img2_green = colorview(RGB, img_green)
img_blue = StackedView(zeroarray, zeroarray,imgarr[3,:,:])
img2_blue = colorview(RGB, img_blue)
#permutedims(imgarr, [2,3,1])
mosaicview(img,img2_red, img2_green, img2_blue; nrow = 2)
[ Info: Precompiling Images [916415d5-f1e6-5110-898d-aaa5f9f070e0] (cache misses: wrong dep version loaded (2))
[ Info: Precompiling PNGFiles [f57f5aa1-a3ce-4bc8-8ab9-96f992907883] (cache misses: wrong dep version loaded (4))
N0f8[0.361N0f8 0.349N0f8 0.345N0f8; 0.361N0f8 0.361N0f8 0.361N0f8; 0.376N0f8 0.353N0f8 0.361N0f8]

PCA

Code
X1=imgarr[1,:,:];X2=imgarr[2,:,:];X3=imgarr[3,:,:];
nl,nc=size(X1);
println(nl," ", nc)
#
# X is the matrix for the PCA
X=[X1[:] X2[:] X3[:]];
println(size(X))
Λ, U, Ψ, Φ, I_total, cum_var_ratio, cov_mat = my_PCA(X)
println("nl = ", nl)
println("nc = ", nc)
println(size(Ψ))
Ψ1 = zeros(nl,nc)
Ψ2 = zeros(nl,nc)
Ψ3 = zeros(nl,nc)
println(size1))
I=1:nl;
for j in 1:nc
   Ψ1[I,j] = Ψ[(j-1)*nl .+ I,1];
   Ψ2[I,j] = Ψ[(j-1)*nl .+ I,2];
   Ψ3[I,j] = Ψ[(j-1)*nl .+ I,3];
end


minΨ1=minimum1)
maxΨ1=maximum1)
PSI1=1 .- minΨ1) ./ (maxΨ1-minΨ1)

using Colors
Gray.(PSI1)
#@view PSI1
512 768
(393216, 3)
nl = 512
nc = 768
(393216, 3)
(512, 768)