Politics, Economics, and Machine Learning: 2017

I learned about the PCA (principal components analysis) algorithm from Andrew Ng's Stanford Machine Learning course, and I had to give the algorithm a test run on my own to see it in action.

PCA itself is not actually a machine learning algorithm - it's an algorithm for simplifying an input data set with many dimensions so that you can then feed it to an ML algorithm without blowing up your computer. Or, more simply put, you can use PCA to narrow down the number of input features you have when you suspect that many of them might be highly correlated with each other.

Where it gets interesting is that you can also use PCA for such high-dimensional objects as JPEG images, which may have tens of thousands of pixels. If you are feeding each pixel one-by-one into a machine learning algorithm for purposes of image recognition, computer vision, or whatever it may be, the computation time is going to be huge.

PCA is a mathematical method to find the best description of correlations between variables in order to use fewer input features in place of more input features. For instance, if temperature (T) and snowfall (S) are highly correlated, why not use one input feature Z that represents both variables simultaneously (where Z is the line which represents the least projection error fit to all the points in the S, T plane) to reduce the algorithm's computation time?

So PCA essentially decomposes a complex, high-dimensional data with N dimensions down to a more manageable data set of K dimensions (which is a number of our own choosing).

Source: Pennsylvania State University - Department of Statistics Online Programs

Without getting into the details of how this is done - it involves linear algebra and some thorny mathematical derivations - here is what I did to test out PCA using images from the Internet:

I collected 100 JPEG images of President Obama from Google Images. Each image was 100 x 100 pixels, which meant 10,000 pixels per image in total. However, my laptop lacked sufficient computation power to do the calculations necessary for dimensionality reduction via PCA with that many inputs - so I used Paint to condense each image down to 50 x 50 size, or 2,500 pixels each. This is the full collection of Obama images I gathered for my input matrix:

Next, I used the pixelmap library in R to convert each of the images into a matrix of numbers, i.e. 2,500 pixel intensity values. With 100 total images, my input feature matrix was 100 x 2500 in size.
After I had my input matrix, I ran it through the PCA algorithm to take the 40 most statistically significant principal components, which retains a full 95% of the variance in input features throughout the 100 sample images. It so happens that the amount of variance retained from the N-dimensional input feature matrix if I compress it down to K dimensions follows a logarithmic type of distribution, as shown below:

This graph above essentially means the 50 x 50 images can be sufficiently represented by only 100 dimensions instead of 2,500 for purposes of numerically representing the pictures based on pixel intensity variation. In fact, even just 40 dimensions captures 95% of the variance. Think about what this means for a second - if a certain pixel has an intensity X, for example, several of the surrounding pixels are highly correlated in pixel intensity to X. So dimensionality reduction lets us represent the whole image numerically in only 100 clusters instead of 2,500.

Now, I was interested in how much distortion the approximation by PCA caused for each of the images. So I ran the reduction matrix in reverse to approximate the original pixel values and printed . This would be similar to say, splitting an average value Z into two equal parts in order to recover original variables X and Y. There's going to be some distortion/loss of information. But I wanted to visually see how much.

Here's an example of how one of the images looked before and after the PCA approximation. That is, compressing 2500 dimensions down to 40 dimensions, and then linearly approximating the original 2500 again:

Original: Approximation (With 95% Variance Retained)

The loss of information makes an already blurry image (because its only 50 x 50 pixels) more unsightly, but guess what? For purposes of a machine learning algorithm, the dimension reduced version will more likely than not suffice.

Real GDP Growth from 1947 to present
(source: FRED, http://fred.stlouis.org)

How would a specialized machine learning algorithm perform at predicting this quarter's real GDP growth, given the last 20 quarters' real GDP growth/decrease percentages as inputs?

Actually, there are already techniques for modeling and predicting time series based on past values of that time series (called "auto-regressive" time series), like the ARIMA discussed in the previous post. ARIMA uses a linear combination of past values of a time series plus a linear combination of past errors of that time series (i.e. estimated minus actuals for the fitted portion) to predict future values.

But I wanted to try a machine learning approach - specifically neural networks, because of their flexibility and adaptability in supervised learning setting.

What is a neural network? (High-level overview)

It's a machine learning algorithm that's inspired by the workings of the biological brain. You have a number of inputs, which you feed to the learning algorithm. You give the algorithm a set of training data, i.e. numerous observations of inputs alongside the dependent variable output, or "Y" value that you wish to predict.

The inputs feed into hidden "neurons". You can choose as many or as few neurons as you wish. You can also change the number of hidden neuron "layers". Each neuron is "activated" or not activated according to a logistic growth function, also called a sigmoid curve, which can take on values between 0 and 1, depending on the input that is fed to that neuron. Essentially, the neural network you build is a mesh of either activated or unactivated neurons that together have the potential to form a complex logical calculus that ultimately leads to an output variable prediction. That is, you feed the neural network A, B, and C, for example. This activates neurons D and F, but not E or G. Neurons D and F then activate J and K, which in turn influence the prediction that is output from the neural networks.

The algorithm "learns" by adapting the weights assigned to each of the connections between neurons. These weighted connections determine, for example, to what extent input A plays a role in activating neuron D, for example. A different weight in the same matrix determines how neuron D, if activated, will influence the final output. So the "learning" process happens by the algorithm optimizing the weights on all neural connections in such a way that error (or "cost", to use precise machine learning terms) will be minimized.

Here is a pictorial representation of an actual neural network implemented in the R interface. It has one input layer with 5 inputs, two hidden layers with 3 neurons each, and an output layer with one variable.

Model Setup:

Now that all that introductory business is out of the way, here's how I actually set up the model for the experiment.

I first gathered the necessary time series data from FRED, i.e. the Federal Reserve Bank of St. Louis website, for Real GDP Growth by Quarter from 1947 Q2 to 2017 Q2. I set up a data frame in R containing one row for each of the quarters between Q2 of 1952 all the way to Q2 of 2017 (starting in 1952 Q2, because this is the first quarter that has 20 full preceding quarters of data to be used as dependent variables). For each quarter's row, I included the lagged GDP growth values for up to 20 quarters - for example, in the row for Q2 2017, I included the following variables:

V1 = Q4 2016 growth/decrease

V2 = Q3 2016 growth/decrease

V3 = Q2 2016 growth/decrease

...

V20 = 2012 Q2 growth/decrease.

So the neural network would include 20 inputs for each training example - one value for each the preceding 20 quarters for GDP growth or decrease. Then, of course, the dependent variable is the actual growth or decrease for that particular quarter.

Choosing Neural Network Structure by Cross-Validation

How many layers, and how many neurons in each layer would cause the algorithm to do the best possible job at learning the pattern in the data and making predictions about the GDP growth or decrease in a given quarters? To be sure, more neurons and layers means greater flexibility in fitting to the training data, and less neurons would mean less flexibility and more bias. But too much flexibility would mean too high variance, or in other words, overfitting to the training data.

So I split the training data of 261 quarters into a training data set of 156 quarters (1952 - 1991), a cross validation data set of 54 quarters (1991 - 2004), and a testing data set 51 quarters (2005 - 2017). The cross validation data set would be used exclusively for choosing the best neural network architecture among a few options.

Here are the models I tested on the CV data set, after training them on the data set from 1952 - 1991, and their performances (measured in terms of MSE, or mean-squared error).

Model	Layers	Neurons	Structure	CV Dataset Error (MSE)
A	1	5	5	0.0504
B	1	10	10	0.0389
C	2	9	3 x 3	0.0432
D	2	25	5 x 5	0.0305
E	3	27	3 x 3 x 3	0.0613
F	3	125	5 x 5 x 5	0.0629

From the above, choosing Model D, with two hidden neuron layers of 5 neurons each, appears to perform the best on an independent cross-validation data set.

The performance by model measured above is actually measuring the by-quarter predictive value of each model, because each quarter is using the prior 20 quarter actuals for predicting the current quarter's GDP change. At no point, does the model incorporate its own predicted values as autoregressive terms for predicting the current quarter's value. So the measure of accuracy above tells us how we can expect the model to perform if, for a given quarter, we have the last 20 actuals for GDP change, and we want to know just this quarter's estimated GDP change.

Performance of Selected Model on Test Data Set (2005 - 2017)

The MSE of the quarter-by-quarter predictions produced by Model D for the 51 quarters in the test data set is 0.4478, which is an error that is quite a bit higher than seen in the cross-validation set. Most of this error manifests in the form of a number of exaggerated predictions for positive growth (see graph below).

Below is a graph of the 51 predicted quarters (blue) versus actuals (black). Note that these are scaled and normalized values for real GDP growth, not actual % growth.

Conclusion

In general, the shapes of the two time series for the test data set period do look similar. But there is a bothersome trend: often, the model does not predict large upward or downward spike in GDP change until a quarter or two after it actually occurred. This is understandable, because once a sharp drop or jump in GDP is one or two quarters in the past, it is included as one of the autoregressive terms in the neural network prediction, and therefore it shows up in future predictions.

In general, this is a lagged time series prediction; more often than not, large swings will not be predicted until its too late. And naturally, that's a limitation of this sort of model. If you ask an expert what the next quarter's GDP growth or decrease will be, based on what happened in the last 20 quarters, you might get some answer that could be considered a reasonable maximum likelihood estimate, or the "expected value". If we have already just entered a period of contraction (recession), he might say, "Oh, I expect next quarter's results to be rather disappointing." That certainly didn't require a genius. And that's essentially what the model is doing for us.

ARIMA time series models include a moving average term, i.e. one that is auto-regressive on past errors. That could possible help this model make better predictions this quarter. Otherwise, we might try using more inputs, i.e. not just past quarter's actual values, but other economic inputs which could give the model and machine learning algorithm more insight into what factors might cause the "perfect storm" that leads to a financial crisis or an economic boom.

The neural network figures out patterns in the data that we can't piece together ourselves. But the algorithm is only as smart or as dumb as the quality of the information we feed it.

Politics, Economics, and Machine Learning

Thursday, November 30, 2017

Obama, Nobama? Using PCA Algorithm for Dimensionality Reduction of Images

Monday, October 9, 2017

Predicting Recessions Using Neural Networks

Obama, Nobama? Using PCA Algorithm for Dimensionality Reduction of Images

Blog Archive