Thursday, November 30, 2017

Obama, Nobama? Using PCA Algorithm for Dimensionality Reduction of Images

I learned about the PCA (principal components analysis) algorithm from Andrew Ng's Stanford Machine Learning course, and I had to give the algorithm a test run on my own to see it in action.

PCA itself is not actually a machine learning algorithm - it's an algorithm for simplifying an input data set with many dimensions so that you can then feed it to an ML algorithm without blowing up your computer. Or, more simply put, you can use PCA to narrow down the number of input features you have when you suspect that many of them might be highly correlated with each other.

Where it gets interesting is that you can also use PCA for such high-dimensional objects as JPEG images, which may have tens of thousands of pixels. If you are feeding each pixel one-by-one into a machine learning algorithm for purposes of image recognition, computer vision, or whatever it may be, the computation time is going to be huge.

PCA is a mathematical method to find the best description of correlations between variables in order to use fewer input features in place of more input features. For instance, if temperature (T) and snowfall (S) are highly correlated, why not use one input feature Z that represents both variables simultaneously (where Z is the line which represents the least projection error fit to all the points in the S, T plane) to reduce the algorithm's computation time?

So PCA essentially decomposes a complex, high-dimensional data with N dimensions down to a more manageable data set of K dimensions (which is a number of our own choosing).

Source: Pennsylvania State University - Department of Statistics Online Programs

Without getting into the details of how this is done - it involves linear algebra and some thorny mathematical derivations - here is what I did to test out PCA using images from the Internet:

  • I collected 100 JPEG images of President Obama from Google Images. Each image was 100 x 100 pixels, which meant 10,000 pixels per image in total. However, my laptop lacked sufficient computation power to do the calculations necessary for dimensionality reduction via PCA with that many inputs - so I used Paint to condense each image down to 50 x 50 size, or 2,500 pixels each. This is the full collection of Obama images I gathered for my input matrix:


  • Next, I used the pixelmap library in R to convert each of the images into a matrix of numbers, i.e. 2,500 pixel intensity values. With 100 total images, my input feature matrix was 100 x 2500 in size.
  • After I had my input matrix, I ran it through the PCA algorithm to take the 40 most statistically significant principal components, which retains a full 95% of the variance in input features throughout the 100 sample images. It so happens that the amount of variance retained from the N-dimensional input feature matrix if I compress it down to K dimensions follows a logarithmic type of distribution, as shown below:

  • This graph above essentially means the 50 x 50 images can be sufficiently represented by only 100 dimensions instead of 2,500 for purposes of numerically representing the pictures based on pixel intensity variation. In fact, even just 40 dimensions captures 95% of the variance. Think about what this means for a second - if a certain pixel has an intensity X, for example, several of the surrounding pixels are highly correlated in pixel intensity to X. So dimensionality reduction lets us represent the whole image numerically in only 100 clusters instead of 2,500.
  • Now, I was interested in how much distortion the approximation by PCA caused for each of the images. So I ran the reduction matrix in reverse to approximate the original pixel values and printed . This would be similar to say, splitting an average value Z into two equal parts in order to recover original variables X and Y. There's going to be some distortion/loss of information. But I wanted to visually see how much.
Here's an example of how one of the images looked before and after the PCA approximation. That is, compressing 2500 dimensions down to 40 dimensions, and then linearly approximating the original 2500 again: 

                   Original:                                               Approximation (With 95% Variance Retained)

                             

The loss of information makes an already blurry image (because its only 50 x 50 pixels) more unsightly, but guess what? For purposes of a machine learning algorithm, the dimension reduced version will more likely than not suffice.

Monday, October 9, 2017

Predicting Recessions Using Neural Networks


Real GDP Growth from 1947 to present
(source: FRED, http://fred.stlouis.org)


How would a specialized machine learning algorithm perform at predicting this quarter's real GDP growth, given the last 20 quarters' real GDP growth/decrease percentages as inputs?

Actually, there are already techniques for modeling and predicting time series based on past values of that time series (called "auto-regressive" time series), like the ARIMA discussed in the previous post. ARIMA uses a linear combination of past values of a time series plus a linear combination of past errors of that time series (i.e. estimated minus actuals for the fitted portion) to predict future values.

But I wanted to try a machine learning approach - specifically neural networks, because of their flexibility and adaptability in supervised learning setting. 
What is a neural network? (High-level overview)

It's a machine learning algorithm that's inspired by the workings of the biological brain. You have a number of inputs, which you feed to the learning algorithm. You give the algorithm a set of training data, i.e. numerous observations of inputs alongside the dependent variable output, or "Y" value that you wish to predict. 

The inputs feed into hidden "neurons". You can choose as many or as few neurons as you wish. You can also change the number of hidden neuron "layers". Each neuron is "activated" or not activated according to a logistic growth function, also called a sigmoid curve, which can take on values between 0 and 1, depending on the input that is fed to that neuron. Essentially, the neural network you build is a mesh of either activated or unactivated neurons that together have the potential to form a complex logical calculus that ultimately leads to an output variable prediction. That is, you feed the neural network A, B, and C, for example. This activates neurons D and F, but not E or G. Neurons D and F then activate J and K, which in turn influence the prediction that is output from the neural networks.

The algorithm "learns" by adapting the weights assigned to each of the connections between neurons. These weighted connections determine, for example, to what extent input A plays a role in activating neuron D, for example. A different weight in the same matrix determines how neuron D, if activated, will influence the final output. So the "learning" process happens by the algorithm optimizing the weights on all neural connections in such a way that error (or "cost", to use precise machine learning terms) will be minimized.

Here is a pictorial representation of an actual neural network implemented in the R interface. It has one input layer with 5 inputs, two hidden layers with 3 neurons each, and an output layer with one variable.


Model Setup:

Now that all that introductory business is out of the way, here's how I actually set up the model for the experiment.

I first gathered the necessary time series data from FRED, i.e. the Federal Reserve Bank of St. Louis website, for Real GDP Growth by Quarter from 1947 Q2 to 2017 Q2. I set up a data frame in R containing one row for each of the quarters between Q2 of 1952 all the way to Q2 of 2017 (starting in 1952 Q2, because this is the first quarter that has 20 full preceding quarters of data to be used as dependent variables). For each quarter's row, I included the lagged  GDP growth values for up to 20 quarters - for example, in the row for Q2 2017, I included the following variables: 

V1 = Q4 2016 growth/decrease
V2 = Q3 2016 growth/decrease
V3 = Q2 2016 growth/decrease
...
V20 = 2012 Q2 growth/decrease.

So the neural network would include 20 inputs for each training example - one value for each the preceding 20 quarters for GDP growth or decrease. Then, of course, the dependent variable is the actual growth or decrease for that particular quarter.

Choosing Neural Network Structure by Cross-Validation

How many layers, and how many neurons in each layer would cause the algorithm to do the best possible job at learning the pattern in the data and making predictions about the GDP growth or decrease in a given quarters? To be sure, more neurons and layers means greater flexibility in fitting to the training data, and less neurons would mean less flexibility and more bias. But too much flexibility would mean too high variance, or in other words, overfitting to the training data.

So I split the training data of 261 quarters into a training data set of 156 quarters (1952 - 1991), a cross validation data set of 54 quarters (1991 - 2004), and a testing data set 51 quarters (2005 - 2017). The cross validation data set would be used exclusively for choosing the best neural network architecture among a few options.

Here are the models I tested on the CV data set, after training them on the data set from 1952 - 1991, and their performances (measured in terms of MSE, or mean-squared error).

ModelLayersNeuronsStructureCV Dataset Error (MSE)
A1550.0504
B110100.0389
C293 x 30.0432
D2255 x 50.0305
E3273 x 3 x 30.0613
F31255 x 5 x 50.0629

From the above, choosing Model D, with two hidden neuron layers of 5 neurons each, appears to perform the best on an independent cross-validation data set.

The performance by model measured above is actually measuring the by-quarter predictive value of each model, because each quarter is using the prior 20 quarter actuals for predicting the current quarter's GDP change. At no point, does the model incorporate its own predicted values as autoregressive terms for predicting the current quarter's value. So the measure of accuracy above tells us how we can expect the model to perform if, for a given quarter, we have the last 20 actuals for GDP change, and we want to know just this quarter's estimated GDP change.

Performance of Selected Model on Test Data Set (2005 - 2017)

The MSE of the quarter-by-quarter predictions produced by Model D for the 51 quarters in the test data set is 0.4478, which is an error that is quite a bit higher than seen in the cross-validation set. Most of this error manifests in the form of a number of exaggerated predictions for positive growth (see graph below).

Below is a graph of the 51 predicted quarters (blue) versus actuals (black). Note that these are scaled and normalized values for real GDP growth, not actual % growth.



Conclusion

In general, the shapes of the two time series for the test data set period do look similar. But there is a bothersome trend: often, the model does not predict large upward or downward spike in GDP change until a quarter or two after it actually occurred. This is understandable, because once a sharp drop or jump in GDP is one or two quarters in the past, it is included as one of the autoregressive terms in the neural network prediction, and therefore it shows up in future predictions.  

In general, this is a lagged time series prediction; more often than not, large swings will not be predicted until its too late. And naturally, that's a limitation of this sort of model. If you ask an expert what the next quarter's GDP growth or decrease will be, based on what happened in the last 20 quarters, you might get some answer that could be considered a reasonable maximum likelihood estimate, or the "expected value". If we have already just entered a period of contraction (recession), he might say, "Oh, I expect next quarter's results to be rather disappointing." That certainly didn't require a genius. And that's essentially what the model is doing for us.

ARIMA time series models include a moving average term, i.e. one that is auto-regressive on past errors. That could possible help this model make better predictions this quarter. Otherwise, we might try using more inputs, i.e. not just past quarter's actual values, but other economic inputs which could give the model and machine learning algorithm more insight into what factors might cause the "perfect storm" that leads to a financial crisis or an economic boom. 

The neural network figures out patterns in the data that we can't piece together ourselves. But the algorithm is only as smart or as dumb as the quality of the information we feed it. 



Friday, December 2, 2016

Predicting Trump's First Term Approval Rating

If we had to model a president's approval rating, what dependent variables might we choose for our model?

The investigation detailed in the previous post taught us at least two things: 
  1. A sitting president's approval rating can be highly correlated with the national unemployment rate over the course of his/her term.
  2. There seems to be a link between how long a sitting president has been in office and his/her approval rating. Often, approval rating trends downward as time progresses. Can we say, familiarity breeds contempt?
I decided to take these two variables (unemployment rate and time a president has been in office) and do a crude projection to see how the next four years might fare for the new president-elect. I did this using OLS (ordinary least squares), i.e. a straight-up, simple linear regression of approval rating on these two variables. 

However, I used an ARIMA process to model and project the path of unemployment over the next four years. I took advantage of the Auto ARIMA feature in R, which automatically chooses an ARIMA model to best fit the given data.

(A fuller introduction to what an ARIMA process is can be found here: https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average )

Basically, ARIMA is a model you can use to forecast future values of a stationary time series (e.g. stock prices, rainfall, frequency of crimes) that is autoregressive, i.e. it depends on its own past values, and contains a random, stochastic component, called "white noise".

This is an outline of the steps I followed to come up with predicted values for the sitting president's approval rating from 2016 Q4 through 2020 Q4 (this of course includes President Obama's final term approval rating which is still unknown, so I grouped it in the forecasting period):


1) Use Auto ARIMA feature in R to pick a time series model and forecast unemployment rate for the next four years (2016Q4 through 2020Q4).

The model chosen by R is an ARIMA(1, 1, 2) model, or a damped trend linear exponential smoothing model, that has the following AR and MA terms, along with standard error values:


Using this model, the forecasted unemployment rates by quarter for the next 17 quarters, i.e. from 2016 Q4 to 2020 Q4, came out as:


                        Forecasted Unemployment Rate (2017-2020)

Could the unemployment rate really change so little over the next four years? Maybe, but this is the mean predicted value from the ARIMA model. The confidence interval itself is much larger - it extends from below 0% (which we know isn't possible) to above 10%. For simplicity, I am using this mean predicted value in the final forecast of approval rating. 

In a way, it makes sense that unemployment would hit that asymptotic line around 5%, because that's close to the frictional, or "minimum" unemployment rate seen historically.

For further illustration, here is the unemployment rate "forecast" with massive 95% confidence intervals in light shaded blue. The mean predicted value I am using is the dark blue line:


2) Fit a linear (OLS) regression model to the data for the independent variable Approval Rating in terms of the dependent variables (Unemployment Rate, Quarters as President).

I created the time variable Quarters as President (QAP) to simply denote how many quarters the sitting president will have been in office as of that quarter during the forecast period (assuming he remains in office for his full term). The QAP for Obama's last term in 2016Q4 is 32.


The statistics summary for the linear regression model Approval Rating ~ Unemployment + QAP indicates that while both dependent variables are significant (i.e. p-value < 0.05), the R squared value is only 0.1932, which means that approval rating clearly depends on other factors than the two we are considering here. Nonetheless, this is an exercise in crude and simple estimation, so I pushed forward.

Linear model summary:



3) Use the linear model from step # 2 with the forecasted unemployment rate in step # 1 to generate predicted mean values and a confidence interval for Approval Rating in the forecast period (2016 Q4 -2020 Q4).

This is the resulting output from the linear regression model in step 2 after plugging in the forecasted unemployment data and QAP ("fit" column is the actual prediction, "lwr" and "upr" are the default 95% confidence intervals):


Shown graphically with upper and lower 95% default confidence intervals, and appended to the actual historical presidential approval rating data from  2010 onwards:

          Forecasted Approval Rating (2017-2020)
Conclusion and Interpretations:

In conclusion, a couple observations:
  • The QAP (quarters as president) variable is responsible for predicting the initial spike in approval rating that comes at the beginning of a new president's term, as the country anticipates the change to a fresh new administration, often with a certain degree of optimism and hopefulness. But because Donald Trump is known to not be terribly popular as an incoming president, it's unlikely that this would be a very pronounced spike.
  • The current unemployment rate (about 5%) is not likely to get much lower, since this is near the lower boundary due to frictional unemployment. Also, the QAP variable captures the fact that approval rating tends to go down over time, rather than up, as the nation grows weary of a president. In other words, after the initial "spike" in approval, there's not much room for Trump's approval rating to increase significantly - at least not on the basis of unemployment rate. However, as I noted earlier, the R-squared value for this model is only 0.1932, so there are many other potential explanatory variables that could determine how the new president's popularity plays out.

Data sources:

1, The American Presidency Project, UCSB (http://www.presidency.ucsb.edu/) 
2. FRED, Civilian Unemployment Rate (https://fred.stlouisfed.org/)

Sunday, November 27, 2016

The Link Between Unemployment and Presidential Approval Rating

The last post attempted to answer the question, does it matter who the president is for the purpose of determining national economic performance (growth)? The answer was, at least statistically speaking, no, probably not. 

But does the president get the blame for a poorly performing economy (and high regard for a strongly performing one)? 

In this investigation, I gathered and aggregated data on national unemployment rate and presidential approval rating by quarter, from 1977Q1 to 2016Q2. This time, I selected unemployment rate as the economic performance metric because the availability of jobs and hiring conditions is probably the most directly related to how the average Joe feels about the state of the economy (more so than inflation, GDP, etc). 

The below graphs are scatter plots for each of the six presidencies from Carter to Obama of average quarterly presidential approval rating against average quarterly civilian unemployment rate. Each dot on the graph represents one quarter during that president's term.










 Conclusions and Interpretations:

When the economy was the main theme of a particular presidency, the poll numbers showed that Americans' satisfaction with the sitting president was strongly correlated with how the economy was performing.

There are three presidencies (Reagan, Bush Sr, Clinton) where the unemployment rate was highly negatively correlated with approval rating - that is, the president's approval rating was higher when unemployment was low, and vice versa. Logically, this is what one might expect. These are strong correlations, as seen from the p-value on the right hand side of the above table, which tells us the probability of the observed values if the null hypothesis were true (i.e. if there were actually no correlation between the two variables whatsoever). All three of these presidencies have p-values under 0.05, indicating that there conclusively is a correlation between approval rating and unemployment at the 5% significance level, and all of the correlations are negative, as shown by the correlation coefficients. 

Interestingly, the other three presidencies (Carter, George W. Bush, Obama) showed slightly positive correlations between the two variables. However, one should not conclude that these presidents benefited in terms of popularity from a sinking economy, because these are all weak correlations, and none of them show any conclusive link between the two variables at the 5% significance level. What this suggests is that other factors than unemployment or economic performance played a more significant in determining these presidents' approval ratings.

Here's a brief look at the presidencies on a case-by-case basis:

1) Jimmy Carter (weak positive/no correlation): He came into office with a high approval rating (almost 70%) even though unemployment was about 7% at the time. This has a lot to do with Republicans falling out of favor following Watergate and Richard Nixon's resignation. Carter's approval did surely decline as the economy worsened and inflation skyrocketed. But this effect is not fully apparent from the above data because of his initially high rating and the fact that his presidency lasted only four years.

2) Ronald Reagan (strong negative correlation): The 1980's began with a recession, which was followed by a long period of expansion and prosperity. Reagan left office with a high approval rating, no doubt in large part because of the country's economic growth under his presidency.

3) George H.W. Bush (strong negative correlation): A souring economy in the early 1990's ended up being the dominant factor in driving Bush Sr's approval ratings down, and a key reason for his defeat in the 1992 presidential election.

4) Bill Clinton (strong negative correlation): "The economy, stupid" was the focus of Clinton's presidential campaign in 1992. The strongest correlation of the six presidencies by far (-0.762), Clinton's presidency coincided with a long boom in economic growth and technological innovation. Generally speaking, Americans responded to these good economic times with high regard for their leader, even amidst the clamor of an impeachment. 

5) George W. Bush (weak positive/no correlation): Never mind how the economy was doing - the country was widely upset over the state of the wars in Iraq and Afghanistan, and this drowned out the "economy" correlation effect. Of course the 2008 financial crisis didn't help Bush's ratings, but that was already near the end of his term.

6) Barack Obama (weak positive/no correlation): He came in highly popular and with great anticipation in the midst of the economic disaster. Unemployment decreased by half during his time in office. However, it seems that Americans did not give Obama proportionately greater props, as gripes over the president's approach to healthcare reform, partisan deadlock in Congress, and handling of other foreign and domestic issues prevented his approval rating from breaking 50% for most of his two terms. 



Data sources:
1, Presidential Approval Rating: The American Presidency Project, University of California, Santa Barbara (http://www.presidency.ucsb.edu/) 
2. Unemployment by quarter: Federal Reserve Bank of St. Louis, Civilian Unemployment Rate 
(https://fred.stlouisfed.org/)


Monday, November 21, 2016

No Evidence that Economic Growth Differs Based on Who is President

If you divide the last 40 years into the six respective presidencies since 1977 (Carter, Reagan, Bush Sr, Clinton, W. Bush, Obama) and treat Real GDP Growth as a normally distributed random variable which produced 40 independent observations, is there a statistically significant difference in the mean GDP growth parameter (μ) between the six presidencies?

Below is the list of years analyzed, from 1977 to 2015 (actually only 39 observations, not 40, because the full year GDP growth for 2016 is not yet determined).

The one-way ANOVA test for equality of means essentially compares inter-group variation to intra-group variation to determine whether the difference in group means is statistically significant.

The assumption that economic growth for the 39 past years is approximately normal holds true, as seen from the histogram of GDP Growth.

GDP Growth by Year and President
(Source: US Bureau of Economic Analysis)

Obs
President
Party
Year
Real Annual
GDP Growth (%)
1
Carter
Democrat
1977
4.90
2
Carter
Democrat
1978
6.68
3
Carter
Democrat
1979
1.30
4
Carter
Democrat
1980
0.00
5
Reagan
Republican
1981
1.29
6
Reagan
Republican
1982
-1.40
7
Reagan
Republican
1983
7.83
8
Reagan
Republican
1984
5.63
9
Reagan
Republican
1985
4.28
10
Reagan
Republican
1986
2.94
11
Reagan
Republican
1987
4.45
12
Reagan
Republican
1988
3.84
13
Bush
Republican
1989
2.78
14
Bush
Republican
1990
0.65
15
Bush
Republican
1991
1.22
16
Bush
Republican
1992
4.33
17
Clinton
Democrat
1993
2.63
18
Clinton
Democrat
1994
4.13
19
Clinton
Democrat
1995
2.28
20
Clinton
Democrat
1996
3.80
21
Clinton
Democrat
1997
4.45
22
Clinton
Democrat
1998
5.00
23
Clinton
Democrat
1999
4.69
24
Clinton
Democrat
2000
2.89
25
W. Bush
Republican
2001
0.21
26
W. Bush
Republican
2002
2.04
27
W. Bush
Republican
2003
4.36
28
W. Bush
Republican
2004
3.12
29
W. Bush
Republican
2005
3.03
30
W. Bush
Republican
2006
2.39
31
W. Bush
Republican
2007
1.87
32
W. Bush
Republican
2008
-2.70
33
Obama
Democrat
2009
-0.20
34
Obama
Democrat
2010
2.73
35
Obama
Democrat
2011
1.68
36
Obama
Democrat
2012
1.28
37
Obama
Democrat
2013
2.66
38
Obama
Democrat
2014
2.49
39
Obama
Democrat
2015
1.88













Conclusion: 


The F-Value from the ANOVA test indicates that there is not enough evidence (at the 5% significance level) to conclude that the different presidencies had inherently different mean real GDP growth parameters.

In other words, we cannot conclude that the observed yearly historical differences in economic growth level by presidency are anything other than the expected level of variation caused by other factors completely separate from the identity of the sitting president.

Why this is true, in my perspective:

1) The Federal Reserve has a more direct and immediate role than the President of the United States in influencing GDP growth through monetary policy. For example, by sharply increasing interest rates, the Fed can curb economic growth and even cause a recession (as happened in the early 1980's).

2) Policies of the current presidential administration may not have an impact until many years (and many presidencies) later. Just as investments in infrastructure, education, and the like will boost economic growth somewhere down the road, bad economic policies can cause financial disasters many years after they are implemented.

3) The business cycle is inherent to a capitalist economy, and periods of growth and contraction are influenced by a wide variety of factors that are out of the president's immediate control - these include consumer spending and confidence, business confidence and uncertainty, world political events, oil shocks, etc.

What the one-way ANOVA test cleverly captures is that GDP growth has on average varied enough within presidencies, no doubt in large part due to the natural boom-and-bust cycles, that we cannot reliably conclude that variance between presidencies is significant.







Obama, Nobama? Using PCA Algorithm for Dimensionality Reduction of Images

I learned about the PCA (principal components analysis) algorithm from Andrew Ng's Stanford Machine Learning course, and I had to give t...