Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
/builds/AstraOS/buildsystem/tbs_build/statsmodels/statsmodels-0.8.0/.pybuild/cpython3_3.7_statsmodels/build/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     951.0
Date:                Fri, 26 Dec 2025   Prob (F-statistic):           2.23e-41
Time:                        15:13:15   Log-Likelihood:                 2.5755
No. Observations:                  50   AIC:                             2.849
Df Residuals:                      46   BIC:                             10.50
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0219      0.082     61.491      0.000       4.857       5.186
x1             0.5046      0.013     40.061      0.000       0.479       0.530
x2             0.4336      0.050      8.758      0.000       0.334       0.533
x3            -0.0213      0.001    -19.278      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        2.898   Durbin-Watson:                   1.726
Prob(Omnibus):                  0.235   Jarque-Bera (JB):                2.767
Skew:                           0.544   Prob(JB):                        0.251
Kurtosis:                       2.621   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.48888172  4.95041693  5.37656968  5.74370702  6.03672503  6.25153031
  6.39571253  6.48729747  6.55178543  6.61796163  6.71316681  6.85880531
  7.06682835  7.33777047  7.66066173  8.01483004  8.37329751  8.70721285
  8.99059156  9.2045851   9.34057766  9.4016017   9.40183974  9.36429422
  9.31700823  9.28845814  9.30287501  9.37626347  9.51377226  9.70884822
  9.94431305 10.1951856  10.43278619 10.62944867 10.76306579 10.82072147
 10.80081277 10.71331333 10.57813469 10.42185487 10.27335171 10.1590583
 10.0986201  10.10166727 10.16623241 10.27907387 10.41785123 10.55479586
 10.66127315 10.71248714]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.67603654 10.52309312 10.27066258  9.95749869  9.63461508  9.35279528
  9.15015915  9.04282984  9.02098624  9.05126755]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.021855
x1                  0.504575
np.sin(x1)          0.433639
I((x1 - 5) ** 2)   -0.021319
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.676037
1    10.523093
2    10.270663
3     9.957499
4     9.634615
5     9.352795
6     9.150159
7     9.042830
8     9.020986
9     9.051268
dtype: float64