How to Grok Principal Component Analysis: Knowledge enormous gives us super powers.

John Cousins
February 7, 2023
10 min read
Photo by National Cancer Institute on Unsplash

The inside fold of the 1970 Mad Dogs and Englishman live double album by Joe Cocker describes Leon Russell, the music director of the tour, as “Master of Space and Time.” That image made a big impression on me as a young teen.

Manipulating data makes me feel like a master of space and time. Working with time series data and multi-dimensional data frames is working fluidly with space and time.

Dimensional reduction is the ultimate manipulation of space and time.

You know that quote from Ecclesiastes that there is nothing new under the sun? Machine learning and AI are truly something new under the sun.

Those algorithms along with PCs, cloud computing, and access to enormous troves of data, give us immense newfound powers. We are all just a surmountable training distance away from super powers.

Grok is a word coined by the science fiction writer Robert A. Heinlein in his 1961 novel Stranger in a Strange Land. It means to understand something intuitively and to establish rapport.

Sometimes the obstacles seem overwhelming. In this piece I want to share an insight I had into one part of the pipeline process: Principal Component Analysis (PCA).

PCA is a crucial step in machine learning, AI, and statistical analysis pipelines.

PCA can be a challenging subject to get your head around. Here I describe the light bulb moment when it clicked for me.

This article isn’t that technical of an overview; it’s conceptual. Once you have the contextual knowledge, the technical details make more sense.

Here I use quantitative finance as an application to describe PCA. Principal component analysis and dimensionality reduction are fundamental in quantitative finance.

PCA is critical in many other fields and domains where you are trying to use known data to predict an outcome. I hope this overview is helpful to you.

Some factors explain an outcome, and we seek to find them. We model these factors as rows in a spreadsheet-style matrix. To manipulate them, we use the mathematics of vectors and matrices of Linear Algebra.

In Python, we store these data sets as DataFrames and Series using Pandas. We use the matrix math of linear algebra to manipulate the Pandas DataFrames.

The size of these computer files can get large and cumbersome to process.

To make them more manageable, we reduce the size. We use PCA to analyze which vectors in the matrix contribute most to the outcome we are modeling.

Our goal is to predict future outcomes from past data. We are looking for signals in the noise. We seek information from the past that has predictive power about the future. We want to extract that information and rank it for its predictive power.

Photo by Austin Distel on Unsplash

Inquantitative finance, we want to predict the future price movements of stocks. And make investment bets based on our factors.

We call these factors alpha factors.

We reduce the size of the data set by eliminating factor vectors that contribute little to predictive power.

Each row of data is a vector in a multidimensional linear algebra space. Some vectors add value, and some are redundant or unimportant. Our goal is dimensional reduction.

PCA is the tool we use to perform the dimensional reduction. We feed in the data set and the amount of predictive power we want to retain. The PCA algorithm crunches the numbers. It delivers a set of the most potent vectors ranked by their contribution.

I first grasped how PCA worked when I thought of it like linear regression. Linear regression takes two variables, an independent and a dependent variable. It plots a line the fits most closely between all the data points. That is a lot of iterative number crunching, but it is easy for a computer to do fast.

Linear Regression line fitting.

PCA works similarly. It plots the first two variables and fits a line between them.

This PCA line has different attributes than the linear regression line. Instead of optimizing for distance, PCA projects the points onto the line and optimizes for the distance between them. That distance is the variance.

We use a scatter plot to plot all the points and get a visual to estimate how the line might slope.

The line has a slope, like in linear regression.

This new line then becomes the X-axis of a new coordinate system. Then PCA looks at the next variable and plots it orthogonal (perpendicular) to the newly calculated X-axis. This second iteration becomes the new Y-axis. It runs the same optimization and projects those points along the Y-axis.

PCA line fitting from scatter plot.

That was the “aha” moment for me. I realized it was a rotation based on creating a line through the data like linear regression.

The rotation of the X and Y-axis represents a transformed coordinate system of the space. This transformation represents a set of numbers that translates the x and y components of the original space into this transformed space. This transformed space is optimized and tailored for this particular data set.

The points in the scatter plot are still in the same place. But the numbers we use to identify their location now need to be represented in the new coordinate system.

The numbers and processes that translate, and the vectors they represent, are called Eigenvectors and Eigenvalues.

Before we can run PCA, the data needs to be normalized. Normalization is subtracting the mean of both variables so that the data lies around the origin (0,0). That is a criterion for Eigenvectors, that the origin remains the same.

A fantastic series of videos to help you get a grasp of linear algebra and Eigenvectors is 3Blue1Brown.

I highly recommend watching the series in order.

PCA chugs through all the data vectors and plots them as additional dimensions in this multidimensional space. PCA spits out the ranking and how many vectors contribute. You can then calibrate how many to use as you proceed with factor analysis.

Let’s get into a little more detail and explore what the rotation represents conceptually and what that does for the data scientist performing PCA.

Methods for automatically reducing the number of columns of a dataset are called dimensionality reduction. One of the most popular ways is principal component analysis or PCA for short.

PCA in data science is a processing step in a pipeline. We want to reduce the complexity of our data set in preparation for procedures we want to perform later in the pipeline.

PCA is a dimensionality reduction technique that can map high dimensional vectors to a lower-dimensional space.

We want to reduce complexity while retaining the informational meaning. In the case of PCA, the information is in the form of variance: how much the data varies. Reducing complexity also reduces vulnerability to aggregated error.

A dataset can have hundreds or thousands or more columns and rows. Each column or row can be a feature. Some features are more relevant than others in their predictive power vis-à-vis a target.

A PCA processed matrix can also be part of a constraint for optimization.

Models built from data that include irrelevant features are cumbersome and vulnerable to accumulated errors. These models are also susceptible to overfitting the training data. They are poorly constructed compared to models trained from the most relevant data.

The number of quantities to calculate in a matrix increases exponentially as the elements in the matrix increases.

For example, a matrix of dimension 3000 has about 4.5 million elements to calculate and track. A matrix of 70 has about 2.5 thousand quantities to estimate. There is a big difference in scale and many fewer opportunities to introduce estimation error.

PCA is an optimization process that reduces the data set, model, and training complexity.

The trick is to figure out which features of the data are relevant and which are not.

Principal Component Analysis compresses data into a format that retains the data’s core information. That core info is the variance of the original data across its columns or rows.

PCA is a series of calculations that give us a new and unique basis for a data set.

So why is it unique?

PCA calculates new dimensions and ranks them by their variance content.

The first dimension is the dimension along which the data points are the most spread out. The data points have the most variance along this dimension.

And what does that mean exactly?

PCA creates a new axis in the 2D plane. PCA calculates the coordinates of our data points along this new axis. It does this by projecting them by the shortest path to the new axis.

PCA chooses the new axis in such a way that the new coordinates are spread out as much as possible. They have maximum variance. The line where the coordinates are most expanded is also the line that minimizes the perpendicular distance of each coordinate to the new axis.

The basis minimizes reconstruction error. Maximizing variance and minimizing reconstruction error go hand in hand.

The translation uses the Pythagorean theorem.

The squared distance from the origin to the projection, plus the squared distance from the projection to the point, equals the squared distance from the origin to the point.

Extraction

PCA extracts the patterns represented by the variance in the data and performs dimensionality reduction. The core of the PCA method is a matrix factorization method from linear algebra called Eigen decomposition.

Say we have a dataset with 1000 columns. Another way to say it is our dataset has 1000 dimensions. Do we need so many dimensions to capture the variance of the dataset? Most times, we don’t.

We need a fast and easy way to remove features that don’t contribute to the variance. With PCA, we can capture the essence of the data of 1000 dimensions in a much lower number of transformed dimensions.

Variance

Each of the 1000 features, represented by the columns or rows, contains a certain amount of variance. Some of the values are higher and some lower than the average.

Features that don’t vary that stay the same over time provide no insight or predictive power.

The more the variance a feature contains, the more critical the feature. The feature comprises more ‘information’. Variance states how the value of a particular feature varies throughout the data. PCA ranks features by their amount of variance.

Principal Components

Now that we know the variance, we need to find a transformed feature set that can explain the variance more efficiently. PCA uses the original 1000 features to make linear combinations that extract the variance into new features. These transformed features are the Principal Components (PCs).

The Principal Components have nothing to do with the original features. The transformed feature set, or Principal Components, has the most significant variance explained in the first PC. The second PC will have the second-highest variance, and so on.

PCA helps you to understand if there are a small number of parts of your data, which can explain a large portion of all the data observations.

Say, for example, the first PC explains 60% of the total variance in data, the 2nd feature explains 25%, and the next four features contain 13% variance. In this case, you have 98% of the variance defined by only 6 Principal Components.

Say the next 100 features in total explain another 1% of the total variance. It makes no sense to include 100 more dimensions to get an extra one percent of variance. By taking the top 6 Principal Components, we have reduced the dimensionality from 100 to 6.

PCA ranks Principal components in the order of their explained variance. We can select the top components to explain a variance ratio of sufficient value. You can choose the level as an input to the PCA generator.

PCA provides insight into how variance is distributed through your data set. PCA creates a reduced data set that is easier to handle in matrix math calculations for optimization. PCA is used in quantitative finance in building risk factor models.

Reducing the dimensionality reduces the effect of error terms that can aggregate up. PCA also addresses over-fitting by eliminating superfluous features. If your model is over-fitting data, it will work well on the test data but well perform poorly on new data. PCA helps address this.

Conclusion

PCA is a feature extraction technique. It is an integral part of feature engineering.

We are looking to create better models with more predictive power, but the map is not the territory. We are creating an abstraction that loses some of the original fidelity. The trick is to make the PC matrix as simple as possible, but no simpler.

Prediction is an ideal that we strive to asymptotically approach. We can’t reach perfection. If we strive for perfection, we can attain excellence.

Python has PCA built-in.

A great feature of Python, if you needed more to convince you it’s a powerful language for data science, is that it has an easy to use PCA engine in Scikit-learn. Scikit-learn is a free software machine-learning library for Python. Import PCA from Scikit-learn and try it out!

PCA is a crucial tool and component in machine learning. I hope this helps make it more accessible as part of your toolkit.

Share this post
John Cousins
Author, Entrepreneur, & Teacher

Receive my 7 day email course

Take your finance skills to the next level with my 7-day corporate finance email course. You'll learn all the essential topics from financial analysis to risk management in a fun, engaging format. Each day, you'll receive an email with practical examples, exercises and resources. Perfect for aspiring finance pros or anyone looking to expand their knowledge. Get ready to transform your finance game!

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

If you like this article. Here are some more articles I think you might like.

Entrepreneurship

The Road to Becoming a Unicorn: A Quick Guide to Startup Funding Rounds

The whole process of funding and developing startups has become more widespread because the cost of getting a product to market has dropped so precipitously in the past couple of decades from millions of dollars to typically anywhere from under $20,000 to $500,000.
John Cousins
December 18, 2023
7 min read
Entrepreneurship

Oh Behave! Behavioral Economics: Why we do what we do.

Behavioral Economics is a method of economic analysis that applies psychological insights into human behavior to explain economic decision making.
John Cousins
December 18, 2023
4 min read
Personal Growth

Professional Development: 3 Types of Skill Sets

To perform effectively in a job and advance to greater responsibilities requires developing a well rounded set of skills. Every job and every industry has its specifics but these skill sets are general and apply across all jobs. Skills are fundamental for turning a job into a career trajectory and this trajectory is part of our personal fulfillment in life.
John Cousins
December 18, 2023
5 min read