Time-Tested Data Analysis Tools, On the Go

An Introduction to Predictive Analytics Using Linear Regression Analysis

Predictive analytics combined with Big Data tools are allowing business executives to leverage the historical and real-time collection of large data sets to attempt to explain future business trends. E-commerce businesses are tracking customer behavior on their websites and using predictive analytics to deliver customized online offerings for their customers. Consumer credit card companies are using predictive analytics to understand their customers’ purchasing habits and offer tailored loyalty programs based on those decisions.

Today, software vendors abound and are offering sophisticated tools for data mining and predictive analysis. In this newsletter, we provide a tutorial that showcases a simple analytical tool a busy manager can readily leverage in Microsoft Excel. Linear regression analysis can be used to predict future values based on previously realized results.

A simple linear regression model will have two variables where one independent variable (X) is used to predict the direction or behavior of a dependent variable (Y) in the forecast (Y = a + bX). Below are straightforward examples of interesting two-variable regression models:

A. Impact of advertising spend (X) on sales activity (Y)

B. Impact of political canvassing (X) on voter turnout (Y)

C. Impact of rest from physical exertion (X) on athletic performance (Y)

D. Impact of corporate earnings (X) on stock price (Y)

For the purposes of this tutorial, we will cover the basics of regression modeling to explain the important aspects of the tool so that a manager can determine if the correlation that exists in the model is sufficient to rely on in order to use the model to predict future outcomes. The steps outlined below will guide you in using Excel’s regression tool to see if it is reliable for measuring the impact of JP Morgan’s corporate earnings performance on the company’s closing stock price on the day of earnings
release. We shall call beating or missing Wall Street’s earnings expectations ‘earnings surprise’ and make this our X-variable. The impact on the price of the company’s stock will be our Y-variable.

Using Linear Regression

1. Open the Excel file Regression Analysis-JPM Stock-Price Action.xlsx using Microsoft Excel (2007 or later), and go to the tab Regression Inputs. 

Load the Analysis ToolPak to access the Regression tool:


2. Click the File tab, Click Options, and then click the Add-Ins category.


3. In the Manage box, select Excel Addins and then click Go.


4. In the Add-Ins dialog box, select the Analysis ToolPak check box.


5. Click OK. Click Yes to install it.


6. Click the Data tab. Within the Analysis
ribbon, select the Data Analysis icon.


7. When the Data Analysis window
opens, select Regression and then
click OK.

The Regression window opens.


8. Still on the tab Regression Inputs, click in the box Input Y-Range (on the Regression window) and use your mouse to highlight cells B1:B21 on
your spreadsheet.


9. Click in the box Input X-Range and use your mouse to highlight cells A1:A21.


10. Check the boxes Labels, Residuals, and Line Fit Plots.


11. Click OK.

  • A Summary Output report is produced (see Regression Report tab).
  • Your model will follow the linear equation Y = a + bX + Error
  • The Coefficients column of the report holds the y-intercept (-0.146354677) and the slope of the line (0.932751283) values.
  • For the purposes of this tutorial, regression analysis is providing us with the equation of a line in the form of: Price Impact = -0.146 + 0.933X

  • The t Stat value provides an indication of the significance of the above coefficients. A t Stat of greater than 3 or less than -3 is very good, while less than 2 or greater than -2 would call into question your assumption that Y and X are correlated variables. For our model, we have a T-Stat of 0.81, which is NOT a strong indicator of the usefulness of this model.
  • The Adjusted R Square represents the variation that is explained in the model, with high numbers providing a better fit. With our model, we are showing only about -2%, not a strong indication of the reliability of this model for predicting stock price impact based on earnings surprises.
  • Low Standard Error or the aggregate standard deviation of the residuals (difference between actual Y data points versus the predicted Y values) is the goal and indicates a strong fit of the model to the actual data. For our model, we are showing 81.1%, so there is quite a bit of error between our predicted values and the actual data points.
  • From the above, one can conclude that our regression model is not going to be an accurate tool for predicting future stock price impact due to earnings surprises, neither when those surprises are the result of beating Wall Street’s expectations nor when missing analyst expectations.
  • This result would seem to indicate that other factors, such as a CEO’s future guidance, could hold significant sway over a stock’s price direction on the day of earnings release.

From the above graph it is clear that a linear regression model does not accurately predict the variations in our dataset.

Before concluding this tutorial we would like to highlight the fact that, in order for a model to be reliable, one must confirm that four additional assumptions hold true:

  1. Linearity: Is the relationship between the X and Y variables linear? The best way to validate this dependency is to plot the line and confirm that the scatterplot of the actual data points is linear around your line. If the scatterplot is not falling in a linear fashion, your model may not be a good tool for predicting future outcomes.
  2. Homoscedasticity: Confirm that the amount of variance around any Y value is constant. You are trying to confirm that the uncertainty or variance does NOT increase as X increases. The best approach for confirming this is to examine a plot of residuals versus fitted values.
  3. Normality: Confirm the normality of your residuals using the k / n + 1 method by forming a plot and looking for an S shape to the curve.
  4. Temporal Independence: Ensure that the residuals for each value of Y are not correlated. To test this assumption, plot the residuals in time series order and confirm that the residuals do not exhibit a consistent pattern above or below the zero (x-axis) line.