A deeper dive into PCA

Attached is a very rough draft of some ideas I’ve been playing with since yesterday’s linear algebra meeting. (Note to self: look into using {conflr} to post Rmd output directly to Confluence).

In short, it would seem that the entire PCA problem of mapping data from our original data space (generally, after standardizing) to “principal component space” can be summarized into a single, very simple, equation:

Y=XV

Where X is the standardized data matrix (with dimensions n×p), V is a matrix of eigen vectors (dimension p×q, where q is the number of principal components chosen, such that q≤p), and Y is the matrix of principal components (n×q). This reinforces that idea the our principle components are simply a linear combination of our data and the eigen vectors.

In this context sdev from prcomp() is just telling us how much the unit vector in the original basis (i.e., principal component space) has been stretched (or is the square root of that value, since the actual eigen values are variance terms, because we use a covariance matrix for the underlying problem).


Below is a little animation I made that highlights the underlying rotation occurring in a PCA, using a 2D data set that has been standardized (i.e., centered and scaled). The first seconds of the animation are the data in their original (standardized) space, and the last seconds are when the data are in their principle component space (highlighted by the axis names). To make the animation a little more entertaining I’ve added a damped sine wave to the rotation.

I’m going to keep working on this, but a couple of things I’d like to add to the animation (my wish list):

  1. Steps that take the original data and center it

    1. Could use linear interpolation (approx()) and {gganimate}'s view_follow() function, combined with scale(…, center = TRUE).

  2. Steps that interpolate the centered data to its standardized equivalent (again, view_follow will be useful).

  3. Some labels highlighting the amount variance along each axis, as the the rotation progresses.

    1. This could be its own image, or possibly another facet. Adding a facet may be difficult, given the current capabilities of gganimate. I could “manually” stitch each time step together with {patchwork} (which is the best {ggplot} stitcher out there at the moment) via a for loop that uses {magick} to combine the images, but I’d rather not do that.

    2. The intent of this part of the animation is to reinforce the idea that the eigen values are just the variance/standard deviation of each PC axis, and that the minimum variance occurs when the principal components are orthogonal to each other.

  4. A Euclidean best-fit line (the zero line on PC2) before the rotation occurs

  5. Lines (lighter in color) highlighting the euclidean distance from the best-fit line to each individual point. I think I can accomplish these last two items because of the derivation I provide, such that: X=YV−1. That transformation will allow me to plot data in PC space, then rotate it back to the scaled data space. This repeated translation from one coordinate system to another is really a repeated change of basis, which is itself just a rotation, in this context.

Thinking of linear algebra as linear transformations between “spaces” has been super informative, by the way.