Chris Malec

Data Scientist

Regression in Python


This is a very quick run-through of some basic statistical concepts, adapted from Lab 4 in Harvard’s CS109 course. Please feel free to try the original lab if you’re feeling ambitious :-) The CS109 git repository also has the solutions if you’re stuck.

Linear regression is used to model and predict continuous outcomes with normal random errors. There are nearly an infinite number of different types of regression models and each regression model is typically defined by the distribution of the prediction errors (called “residuals”) of the type of data. Logistic regression is used to model binary outcomes whereas Poisson regression is used to predict counts. In this exercise, we’ll see some examples of linear regression as well as Train-test splits.

The packages we’ll cover are: statsmodels, seaborn, and scikit-learn. While we don’t explicitly teach statsmodels and seaborn in the Springboard workshop, those are great libraries to know. ***

***

# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn

import seaborn as sns

# special matplotlib argument for improved plots
from matplotlib import rcParams
sns.set_style("whitegrid")
sns.set_context("poster")


Part 1: Introduction to Linear Regression

Purpose of linear regression


Given a dataset containing predictor variables $X$ and outcome/response variable $Y$, linear regression can be used to:

  • Build a predictive model to predict future values of $\hat{Y}$, using new data $X^*$ where $Y$ is unknown.
  • Model the strength of the relationship between each independent variable $X_i$ and $Y$
    • Many times, only a subset of independent variables $X_i$ will have a linear relationship with $Y$
    • Need to figure out which $X_i$ contributes most information to predict $Y$
  • It is in many cases, the first pass prediction algorithm for continuous outcomes.

A Brief Mathematical Recap


Linear Regression is a method to model the relationship between a set of independent variables $X$ (also knowns as explanatory variables, features, predictors) and a dependent variable $Y$. This method assumes the relationship between each predictor $X$ is linearly related to the dependent variable $Y$. The most basic linear regression model contains one independent variable $X$, we’ll call this the simple model.

\[Y = \beta_0 + \beta_1 X + \epsilon\]

where $\epsilon$ is considered as an unobservable random variable that adds noise to the linear relationship. In linear regression, $\epsilon$ is assumed to be normally distributed with a mean of 0. In other words, what this means is that on average, if we know $Y$, a roughly equal number of predictions $\hat{Y}$ will be above $Y$ and others will be below $Y$. That is, on average, the error is zero. The residuals, $\epsilon$ are also assumed to be “i.i.d.”: independently and identically distributed. Independence means that the residuals are not correlated – the residual from one prediction has no effect on the residual from another prediction. Correlated errors are common in time series analysis and spatial analyses.

\[y = f(x) = E(Y | X = x)\]

conditional mean Image from http://www.learner.org/courses/againstallodds/about/glossary.html. Note this image uses $\alpha$ and $\beta$ instead of $\beta_0$ and $\beta_1$.

\[\hat{\beta}_0, \hat{\beta}_1\] \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1\] \[Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon\]

Estimating $\hat\beta$: Least squares


Least squares is a method that can estimate the coefficients of a linear model by minimizing the squared residuals:

\[\mathscr{L} = \sum_{i=1}^N \epsilon_i^2 = \sum_{i=1}^N \left( y_i - \hat{y}_i \right)^2 = \sum_{i=1}^N \left(y_i - \left(\beta_0 + \beta_1 x_i\right)\right)^2\]

where $N$ is the number of observations and $\epsilon$ represents a residual or error, ACTUAL - PREDICTED.

Estimating the intercept $\hat{\beta_0}$ for the simple linear model

We want to minimize the squared residuals and solve for $\hat{\beta_0}$ so we take the partial derivative of $\mathscr{L}$ with respect to $\hat{\beta_0}$

$ \begin{align} \frac{\partial \mathscr{L}}{\partial \hat{\beta_0}} &= \frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^N \epsilon^2
&= \frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^N \left( y_i - \hat{y}i \right)^2
&= \frac{\partial}{\partial \hat{\beta_0}} \sum
{i=1}^N \left( y_i - \left( \hat{\beta}0 + \hat{\beta}_1 x_i \right) \right)^2
&= -2 \sum
{i=1}^N \left( y_i - \left( \hat{\beta}0 + \hat{\beta}_1 x_i \right) \right) \hspace{25mm} \mbox{(by chain rule)}
&= -2 \sum
{i=1}^N (y_i - \hat{\beta}0 - \hat{\beta}_1 x_i)
&= -2 \left[ \left( \sum
{i=1}^N y_i \right) - N \hat{\beta_0} - \hat{\beta}1 \left( \sum{i=1}^N x_i \right) \right]
& 2 \left[ N \hat{\beta}0 + \hat{\beta}_1 \sum{i=1}^N x_i - \sum_{i=1}^N y_i \right] = 0 \hspace{20mm} \mbox{(Set equal to 0 and solve for $\hat{\beta}0$)}
& N \hat{\beta}_0 + \hat{\beta}_1 \sum
{i=1}^N x_i - \sum_{i=1}^N y_i = 0
& N \hat{\beta}0 = \sum{i=1}^N y_i - \hat{\beta}1 \sum{i=1}^N x_i
& \hat{\beta}0 = \frac{\sum{i=1}^N y_i - \hat{\beta}1 \sum{i=1}^N x_i}{N}
& \hat{\beta}0 = \frac{\sum{i=1}^N y_i}{N} - \hat{\beta}1 \frac{\sum{i=1}^N x_i}{N}
& \boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}} \end{align} $

Using this new information, we can compute the estimate for $\hat{\beta}_1$ by taking the partial derivative of $\mathscr{L}$ with respect to $\hat{\beta}_1$.

$ \begin{align} \frac{\partial \mathscr{L}}{\partial \hat{\beta_1}} &= \frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^N \epsilon^2
&= \frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^N \left( y_i - \hat{y}i \right)^2
&= \frac{\partial}{\partial \hat{\beta_1}} \sum
{i=1}^N \left( y_i - \left( \hat{\beta}0 + \hat{\beta}_1 x_i \right) \right)^2
&= 2 \sum
{i=1}^N \left( y_i - \left( \hat{\beta}0 + \hat{\beta}_1 x_i \right) \right) \left( -x_i \right) \hspace{25mm}\mbox{(by chain rule)}
&= -2 \sum
{i=1}^N x_i \left( y_i - \hat{\beta}0 - \hat{\beta}_1 x_i \right)
&= -2 \sum
{i=1}^N x_i (y_i - \hat{\beta}0 x_i - \hat{\beta}_1 x_i^2)
&= -2 \sum
{i=1}^N x_i (y_i - \left( \bar{y} - \hat{\beta}1 \bar{x} \right) x_i - \hat{\beta}_1 x_i^2)
&= -2 \sum
{i=1}^N (x_i y_i - \bar{y}x_i + \hat{\beta}1\bar{x}x_i - \hat{\beta}_1 x_i^2)
&= -2 \left[ \sum
{i=1}^N x_i y_i - \bar{y} \sum_{i=1}^N x_i + \hat{\beta}1\bar{x}\sum{i=1}^N x_i - \hat{\beta}1 \sum{i=1}^N x_i^2 \right]
&= -2 \left[ \hat{\beta}1 \left{ \bar{x} \sum{i=1}^N x_i - \sum_{i=1}^N x_i^2 \right} + \left{ \sum_{i=1}^N x_i y_i - \bar{y} \sum_{i=1}^N x_i \right}\right]
& 2 \left[ \hat{\beta}1 \left{ \sum{i=1}^N x_i^2 - \bar{x} \sum_{i=1}^N x_i \right} + \left{ \bar{y} \sum_{i=1}^N x_i - \sum_{i=1}^N x_i y_i \right} \right] = 0
& \hat{\beta}1 = \frac{-\left( \bar{y} \sum{i=1}^N x_i - \sum_{i=1}^N x_i y_i \right)}{\sum_{i=1}^N x_i^2 - \bar{x}\sum_{i=1}^N x_i}
&= \frac{\sum_{i=1}^N x_i y_i - \bar{y} \sum_{i=1}^N x_i}{\sum_{i=1}^N x_i^2 - \bar{x} \sum_{i=1}^N x_i}
& \boxed{\hat{\beta}1 = \frac{\sum{i=1}^N x_i y_i - \bar{x}\bar{y}n}{\sum_{i=1}^N x_i^2 - n \bar{x}^2}} \end{align} $

The solution can be written in compact matrix notation as

\[\hat\beta = (X^T X)^{-1}X^T Y\]

We wanted to show you this in case you remember linear algebra, in order for this solution to exist we need $X^T X$ to be invertible. Of course this requires a few extra assumptions, $X$ must be full rank so that $X^T X$ is invertible, etc. Basically, $X^T X$ is full rank if all rows and columns are linearly independent. This has a loose relationship to variables and observations being independent respective. This is important for us because this means that having redundant features in our regression models will lead to poorly fitting (and unstable) models. We’ll see an implementation of this in the extra linear regression example.


Part 2: Exploratory Data Analysis for Linear Relationships

The Boston Housing data set contains information about the housing values in suburbs of Boston. This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University and is now available on the UCI Machine Learning Repository.

Load the Boston Housing data set from sklearn


This data set is available in the sklearn python module which is how we will access it today.

from sklearn.datasets import load_boston
import pandas as pd

boston = load_boston()
boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
boston.data.shape
(506, 13)
# Print column names
print(boston.feature_names)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
# Print description of Boston housing data set
print(boston.DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Now let’s explore the data set itself.

bos = pd.DataFrame(boston.data)
bos.head()
0 1 2 3 4 5 6 7 8 9 10 11 12
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

There are no column names in the DataFrame. Let’s add those.

bos.columns = boston.feature_names
bos.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Now we have a pandas DataFrame called bos containing all the data we want to use to predict Boston Housing prices. Let’s create a variable called PRICE which will contain the prices. This information is contained in the target data.

print(boston.target.shape)
(506,)
bos['PRICE'] = boston.target
bos.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

EDA and Summary Statistics


Let’s explore this data set. First we use describe() to get basic summary statistics for each of the columns.

bos.describe()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

Scatterplots


Let’s look at some scatter plots for three variables: ‘CRIM’ (per capita crime rate), ‘RM’ (number of rooms) and ‘PTRATIO’ (pupil-to-teacher ratio in schools).

plt.scatter(bos.CRIM, bos.PRICE)
plt.xlabel("Per capita crime rate by town (CRIM)")
plt.ylabel("Housing Price")
plt.title("Relationship between CRIM and Price")
Text(0.5, 1.0, 'Relationship between CRIM and Price')

png

Part 2 Checkup Exercise Set I

Exercise: What kind of relationship do you see? e.g. positive, negative? linear? non-linear? Is there anything else strange or interesting about the data? What about outliers?

Exercise: Create scatter plots between *RM* and *PRICE*, and *PTRATIO* and *PRICE*. Label your axes appropriately using human readable labels. Tell a story about what you see.

Exercise: What are some other numeric variables of interest? Why do you think they are interesting? Plot scatterplots with these variables and *PRICE* (house price) and tell a story about what you see.

your turn: describe relationship

The data is not described linearly, Housing price seems largely independent of crime rate at low housing prices, but a relationship develops at higher crime rates. A pattern of lower housing prices with higher crime rate that may be linear evolves.

# your turn: scatter plot between *RM* and *PRICE*
plt.scatter(bos.RM, bos.PRICE)
plt.xlabel("Average Number of Rooms per Dwelling (RM)")
plt.ylabel("Housing Price")
plt.title("Relationship between RM and Price")
Text(0.5, 1.0, 'Relationship between RM and Price')

png

# your turn: scatter plot between *PTRATIO* and *PRICE*
plt.scatter(bos.PTRATIO, bos.PRICE)
plt.xlabel("Pupil-Teacher Ratio by Town (PTRATIO)")
plt.ylabel("Housing Price")
plt.title("Relationship between PTRATIO and Price")
Text(0.5, 1.0, 'Relationship between PTRATIO and Price')

png

# your turn: create some other scatter plots

plt.scatter(bos.LSTAT, bos.PRICE)
plt.xlabel("% lower status of the population (LSTAT)")
plt.ylabel("Housing Price")
plt.title("Relationship between LSTAT and Price")
Text(0.5, 1.0, 'Relationship between LSTAT and Price')

png

plt.scatter(bos.NOX, bos.PRICE)
plt.xlabel("Nitric Oxide Concentration - parts per million (NOX)")
plt.ylabel("Housing Price")
plt.title("Relationship between NOX and Price")
Text(0.5, 1.0, 'Relationship between NOX and Price')

png

Scatterplots using Seaborn


Seaborn is a cool Python plotting library built on top of matplotlib. It provides convenient syntax and shortcuts for many common types of plots, along with better-looking defaults.

We can also use seaborn regplot for the scatterplot above. This provides automatic linear regression fits (useful for data exploration later on). Here’s one example below.

sns.regplot(y="PRICE", x="RM", data=bos, fit_reg = True)
/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<matplotlib.axes._subplots.AxesSubplot at 0x1a17780550>

png

Histograms


plt.hist(np.log(bos.CRIM))
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()

png

Part 2 Checkup Exercise Set II

Exercise: In the above histogram, we took the logarithm of the crime rate per capita. Repeat this histogram without taking the log. What was the purpose of taking the log? What do we gain by making this transformation? What do you now notice about this variable that is not obvious without making the transformation?

Exercise: Plot the histogram for *RM* and *PTRATIO* against each other, along with the two variables you picked in the previous section. We are looking for correlations in predictors here.

</div> ```python #your turn plt.hist(bos.CRIM) plt.title("CRIM") plt.xlabel("Crime rate per capita") plt.ylabel("Frequencey") plt.show() ``` ![png](output_36_0.png) The purpose of the logarithm is to be able to visually compare the counts of low crime rate per capita and high crime rate per capita. Without the logarithm, the low crime rate per capita counts dominate the histogram. ```python #your turn g = sns.jointplot(x = bos.RM, y = bos.PTRATIO) g = g.set_axis_labels("RM", "PTRATIO") ``` ![png](output_38_0.png) ```python g = sns.jointplot(x = bos.LSTAT, y = bos.NOX) g = g.set_axis_labels("LSTAT", "NOX") ``` ![png](output_39_0.png) ## Part 3: Linear Regression with Boston Housing Data Example *** Here, $Y$ = boston housing prices (called "target" data in python, and referred to as the dependent variable or response variable) and $X$ = all the other features (or independent variables, predictors or explanatory variables) which we will use to fit a linear regression model and predict Boston housing prices. We will use the least-squares method to estimate the coefficients. We'll use two ways of fitting a linear regression. We recommend the first but the second is also powerful in its features. ### Fitting Linear Regression using `statsmodels` *** [Statsmodels](http://statsmodels.sourceforge.net/) is a great Python library for a lot of basic and inferential statistics. It also provides basic regression functions using an R-like syntax, so it's commonly used by statisticians. While we don't cover statsmodels officially in the Data Science Intensive workshop, it's a good library to have in your toolbox. Here's a quick example of what you could do with it. The version of least-squares we will use in statsmodels is called *ordinary least-squares (OLS)*. There are many other versions of least-squares such as [partial least squares (PLS)](https://en.wikipedia.org/wiki/Partial_least_squares_regression) and [weighted least squares (WLS)](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares). ```python # Import regression modules import statsmodels.api as sm from statsmodels.formula.api import ols ``` ```python # statsmodels works nicely with pandas dataframes # The thing inside the "quotes" is called a formula, a bit on that below m = ols('PRICE ~ RM',bos).fit() print(m.summary()) ``` OLS Regression Results ============================================================================== Dep. Variable: PRICE R-squared: 0.484 Model: OLS Adj. R-squared: 0.483 Method: Least Squares F-statistic: 471.8 Date: Sat, 25 May 2019 Prob (F-statistic): 2.49e-74 Time: 10:34:46 Log-Likelihood: -1673.1 No. Observations: 506 AIC: 3350. Df Residuals: 504 BIC: 3359. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -34.6706 2.650 -13.084 0.000 -39.877 -29.465 RM 9.1021 0.419 21.722 0.000 8.279 9.925 ============================================================================== Omnibus: 102.585 Durbin-Watson: 0.684 Prob(Omnibus): 0.000 Jarque-Bera (JB): 612.449 Skew: 0.726 Prob(JB): 1.02e-133 Kurtosis: 8.190 Cond. No. 58.4 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. #### Interpreting coefficients There is a ton of information in this output. But we'll concentrate on the coefficient table (middle table). We can interpret the `RM` coefficient (9.1021) by first noticing that the p-value (under `P>|t|`) is so small, basically zero. This means that the number of rooms, `RM`, is a statisticall significant predictor of `PRICE`. The regression coefficient for `RM` of 9.1021 means that *on average, each additional room is associated with an increase of $\$9,100$ in house price net of the other variables*. The confidence interval gives us a range of plausible values for this average change, about ($\$8,279, \$9,925$), definitely not chump change. In general, the $\hat{\beta_i}, i > 0$ can be interpreted as the following: "A one unit increase in $x_i$ is associated with, on average, a $\hat{\beta_i}$ increase/decrease in $y$ net of all other variables." On the other hand, the interpretation for the intercept, $\hat{\beta}_0$ is the average of $y$ given that all of the independent variables $x_i$ are 0. #### `statsmodels` formulas *** This formula notation will seem familiar to `R` users, but will take some getting used to for people coming from other languages or are new to statistics. The formula gives instruction for a general structure for a regression call. For `statsmodels` (`ols` or `logit`) calls you need to have a Pandas dataframe with column names that you will add to your formula. In the below example you need a pandas data frame that includes the columns named (`Outcome`, `X1`,`X2`, ...), but you don't need to build a new dataframe for every regression. Use the same dataframe with all these things in it. The structure is very simple: `Outcome ~ X1` But of course we want to to be able to handle more complex models, for example multiple regression is doone like this: `Outcome ~ X1 + X2 + X3` In general, a formula for an OLS multiple linear regression is `Y ~ X1 + X2 + ... + Xp` This is the very basic structure but it should be enough to get you through the homework. Things can get much more complex. You can force statsmodels to treat variables as categorical with the `C()` function, call numpy functions to transform data such as `np.log` for extremely-skewed data, or fit a model without an intercept by including `- 1` in the formula. For a quick run-down of further uses see the `statsmodels` [help page](http://statsmodels.sourceforge.net/devel/example_formulas.html). Let's see how our model actually fit our data. We can see below that there is a ceiling effect, we should probably look into that. Also, for large values of $Y$ we get underpredictions, most predictions are below the 45-degree gridlines.

Part 3 Checkup Exercise Set I

Exercise: Create a scatterplot between the predicted prices, available in `m.fittedvalues` (where `m` is the fitted model) and the original prices. How does the plot look? Do you notice anything interesting or weird in the plot? Comment on what you see.

```python # your turn plt.scatter(bos.RM, m.fittedvalues) plt.xlabel("Number of Rooms") plt.ylabel("Predicted Housing Price") plt.title("Predicted Price from Number of Rooms") ``` Text(0.5, 1.0, 'Predicted Price from Number of Rooms') ![png](output_49_1.png) The fitted data is (of course) perfectly linear since this is a linear prediction. However, the model predicts a price of less than zero for a finite number of rooms. This is obviously impossible, and means using only mean number of rooms is too simple a model. ### Fitting Linear Regression using `sklearn` ```python from sklearn.linear_model import LinearRegression X = bos.drop('PRICE', axis = 1) # This creates a LinearRegression object lm = LinearRegression() lm ``` LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) #### What can you do with a LinearRegression object? *** Check out the scikit-learn [docs here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). We have listed the main functions here. Most machine learning models in scikit-learn follow this same API of fitting a model with `fit`, making predictions with `predict` and the appropriate scoring function `score` for each model. Main functions | Description --- | --- `lm.fit()` | Fit a linear model `lm.predit()` | Predict Y using the linear model with estimated coefficients `lm.score()` | Returns the coefficient of determination (R^2). *A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model* #### What output can you get? ```python # Look inside lm object # lm. ``` Output | Description --- | --- `lm.coef_` | Estimated coefficients `lm.intercept_` | Estimated intercept ### Fit a linear model *** The `lm.fit()` function estimates the coefficients the linear regression using least squares. ```python # Use all 13 predictors to fit linear regression model lm.fit(X, bos.PRICE) ``` LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Part 3 Checkup Exercise Set II

Exercise: How would you change the model to not fit an intercept term? Would you recommend not having an intercept? Why or why not? For more information on why to include or exclude an intercept, look [here](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-is-regression-through-the-origin/).

Exercise: One of the assumptions of the linear model is that the residuals must be i.i.d. (independently and identically distributed). To satisfy this, is it enough that the residuals are normally distributed? Explain your answer.

Exercise: True or false. To use linear regression, $Y$ must be normally distributed. Explain your answer.

# your turn I would change the model to exclude an intercept by centering the data (subtracting the mean) and then fitting the model using the linear model object's fit method with 'fit_intercept' set to False. If the residuals are normally distributed, then they are already i.i.d. Y does not need to be normally distributed, this requirement is only on the residuals. Ideally, Y is subtacted off of the original data by the model, so it is irrelevant to the residuals. ### Estimated intercept and coefficients Let's look at the estimated coefficients from the linear model using `1m.intercept_` and `lm.coef_`. After we have fit our linear regression model using the least squares method, we want to see what are the estimates of our coefficients $\beta_0$, $\beta_1$, ..., $\beta_{13}$: $$ \hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_{13} $$ ```python print('Estimated intercept coefficient: {}'.format(lm.intercept_)) ``` Estimated intercept coefficient: 36.45948838509009 ```python print('Number of coefficients: {}'.format(len(lm.coef_))) ``` Number of coefficients: 13 ```python # The coefficients pd.DataFrame({'features': X.columns, 'estimatedCoefficients': lm.coef_})[['features', 'estimatedCoefficients']] ```
features estimatedCoefficients
0 CRIM -0.108011
1 ZN 0.046420
2 INDUS 0.020559
3 CHAS 2.686734
4 NOX -17.766611
5 RM 3.809865
6 AGE 0.000692
7 DIS -1.475567
8 RAD 0.306049
9 TAX -0.012335
10 PTRATIO -0.952747
11 B 0.009312
12 LSTAT -0.524758
### Predict Prices We can calculate the predicted prices ($\hat{Y}_i$) using `lm.predict`. $$ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \ldots \hat{\beta}_{13} X_{13} $$ ```python # first five predicted prices lm.predict(X)[0:5] ``` array([30.00384338, 25.02556238, 30.56759672, 28.60703649, 27.94352423])

Part 3 Checkup Exercise Set III

Exercise: Histogram: Plot a histogram of all the predicted prices. Write a story about what you see. Describe the shape, center and spread of the distribution. Are there any outliers? What might be the reason for them? Should we do anything special with them?

Exercise: Scatterplot: Let's plot the true prices compared to the predicted prices to see they disagree (we did this with `statsmodels` before).

Exercise: We have looked at fitting a linear model in both `statsmodels` and `scikit-learn`. What are the advantages and disadvantages of each based on your exploration? Based on the information provided by both packages, what advantage does `statsmodels` provide?

```python # your turn plt.hist(lm.predict(X)) plt.xlabel('Housing Price') plt.ylabel('Frequency') plt.title('Distribution of Predicted Housing Prices') plt.show() ``` ![png](output_69_0.png) The predicted housing prices seem to be roughly normally distributed and peaked around 25k that appears about 125 times in the data. There is rougly a 10k spread on either side of the peak. There really aren't any outliers in the data in the sense that even the low and high prices are still within 20k of the center. Unless there were some compelling reason to believe that the low and high prices were mistakes, we should typically keep outliers, possibly separating them from the main graph for visualization purposes. ```python plt.scatter(bos.PRICE, lm.predict(X)) plt.xlabel('True Housing Price') plt.ylabel('Predicted Housing Price') plt.title('Predicted vs True Housing Price') plt.show() ``` ![png](output_71_0.png) Statsmodel has a slightly more detailed summary output, and so this may be more useful for generating readable reports about the fitted model. SKlearn is more geared towards fitting models with minimal knowledge of the underlying data, it is easier to automate and may be more appropriate when trying to explore high dimensional data sets. ### Evaluating the Model: Sum-of-Squares The partitioning of the sum-of-squares shows the variance in the predictions explained by the model and the variance that is attributed to error. $$TSS = ESS + RSS$$ #### Residual Sum-of-Squares (aka $RSS$) The residual sum-of-squares is one of the basic ways of quantifying how much error exists in the fitted model. We will revisit this in a bit. $$ RSS = \sum_{i=1}^N r_i^2 = \sum_{i=1}^N \left(y_i - \left(\beta_0 + \beta_1 x_i\right)\right)^2 $$ ```python print(np.sum((bos.PRICE - lm.predict(X)) ** 2)) ``` 11078.784577954977 #### Explained Sum-of-Squares (aka $ESS$) The explained sum-of-squares measures the variance explained by the regression model. $$ESS = \sum_{i=1}^N \left( \hat{y}_i - \bar{y} \right)^2 = \sum_{i=1}^N \left( \left( \hat{\beta}_0 + \hat{\beta}_1 x_i \right) - \bar{y} \right)^2$$ ```python print(np.sum((lm.predict(X) - np.mean(bos.PRICE)) ** 2)) ``` 31637.510837064707 ### Evaluating the Model: The Coefficient of Determination ($R^2$) The coefficient of determination, $R^2$, tells us the percentage of the variance in the response variable $Y$ that can be explained by the linear regression model. $$ R^2 = \frac{ESS}{TSS} $$ The $R^2$ value is one of the most common metrics that people use in describing the quality of a model, but it is important to note that *$R^2$ increases artificially as a side-effect of increasing the number of independent variables.* While $R^2$ is reported in almost all statistical packages, another metric called the *adjusted $R^2$* is also provided as it takes into account the number of variables in the model, and can sometimes even be used for non-linear regression models! $$R_{adj}^2 = 1 - \left( 1 - R^2 \right) \frac{N - 1}{N - K - 1} = R^2 - \left( 1 - R^2 \right) \frac{K}{N - K - 1} = 1 - \frac{\frac{RSS}{DF_R}}{\frac{TSS}{DF_T}}$$ where $N$ is the number of observations, $K$ is the number of variables, $DF_R = N - K - 1$ is the degrees of freedom associated with the residual error and $DF_T = N - 1$ is the degrees of the freedom of the total error. ### Evaluating the Model: Mean Squared Error and the $F$-Statistic *** The mean squared errors are just the *averages* of the sum-of-squares errors over their respective degrees of freedom. $$MSE = \frac{RSS}{N-K-1}$$ $$MSR = \frac{ESS}{K}$$ **Remember:** Notation may vary across resources particularly the use of $R$ and $E$ in $RSS/ESS$ and $MSR/MSE$. In some resources, E = explained and R = residual. In other resources, E = error and R = regression (explained). **This is a very important distinction that requires looking at the formula to determine which naming scheme is being used.** Given the MSR and MSE, we can now determine whether or not the entire model we just fit is even statistically significant. We use an $F$-test for this. The null hypothesis is that all of the $\beta$ coefficients are zero, that is, none of them have any effect on $Y$. The alternative is that *at least one* $\beta$ coefficient is nonzero, but it doesn't tell us which one in a multiple regression: $$H_0: \beta_i = 0, \mbox{for all $i$} \\ H_A: \beta_i > 0, \mbox{for some $i$}$$ $$F = \frac{MSR}{MSE} = \left( \frac{R^2}{1 - R^2} \right) \left( \frac{N - K - 1}{K} \right)$$ Once we compute the $F$-statistic, we can use the $F$-distribution with $N-K$ and $K-1$ degrees of degrees of freedom to get a p-value. **Warning!** The $F$-statistic mentioned in this section is NOT the same as the F1-measure or F1-value discused in Unit 7.

Part 3 Checkup Exercise Set IV

Let's look at the relationship between `PTRATIO` and housing price.

Exercise: Try fitting a linear regression model using only the 'PTRATIO' (pupil-teacher ratio by town) and interpret the intercept and the coefficients.

Exercise: Calculate (or extract) the $R^2$ value. What does it tell you?

Exercise: Compute the $F$-statistic. What does it tell you?

Exercise: Take a close look at the $F$-statistic and the $t$-statistic for the regression coefficient. What relationship do you notice? Note that this relationship only applies in *simple* linear regression models.

```python # your turn m = ols('PRICE ~ PTRATIO',bos).fit() print(m.summary()) ``` OLS Regression Results ============================================================================== Dep. Variable: PRICE R-squared: 0.258 Model: OLS Adj. R-squared: 0.256 Method: Least Squares F-statistic: 175.1 Date: Sat, 25 May 2019 Prob (F-statistic): 1.61e-34 Time: 10:34:49 Log-Likelihood: -1764.8 No. Observations: 506 AIC: 3534. Df Residuals: 504 BIC: 3542. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 62.3446 3.029 20.581 0.000 56.393 68.296 PTRATIO -2.1572 0.163 -13.233 0.000 -2.477 -1.837 ============================================================================== Omnibus: 92.924 Durbin-Watson: 0.725 Prob(Omnibus): 0.000 Jarque-Bera (JB): 191.444 Skew: 1.001 Prob(JB): 2.68e-42 Kurtosis: 5.252 Cond. No. 160. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. The $R^2$ value is only 0.258 meaning that the linear fit is not very good, and there is a large amount of residual error. The F-statistic is given as 175.1 which means that the model is statistically significant, and there is indeed a relationship. This does not imply that this is the best model, however. The t-statistic is the opposite sign from the F-statistic. This is because the sign of the coefficient is negative, and therefore the t-statistic is negative, and the F-statistic is always positive since it is a ratio of two sums of squares.

Part 3 Checkup Exercise Set V

Fit a linear regression model using three independent variables

  1. 'CRIM' (per capita crime rate by town)
  2. 'RM' (average number of rooms per dwelling)
  3. 'PTRATIO' (pupil-teacher ratio by town) </ol>

    Exercise: Compute or extract the $F$-statistic. What does it tell you about the model?

    Exercise: Compute or extract the $R^2$ statistic. What does it tell you about the model?

    Exercise: Which variables in the model are significant in predicting house price? Write a story that interprets the coefficients.

    </div> ```python # your turn m = ols('PRICE ~ PTRATIO+CRIM+RM',bos).fit() print(m.summary()) ``` OLS Regression Results ============================================================================== Dep. Variable: PRICE R-squared: 0.594 Model: OLS Adj. R-squared: 0.592 Method: Least Squares F-statistic: 245.2 Date: Sat, 25 May 2019 Prob (F-statistic): 6.15e-98 Time: 10:34:49 Log-Likelihood: -1612.0 No. Observations: 506 AIC: 3232. Df Residuals: 502 BIC: 3249. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -3.3707 4.034 -0.836 0.404 -11.296 4.555 PTRATIO -1.0695 0.133 -8.051 0.000 -1.331 -0.809 CRIM -0.2050 0.032 -6.399 0.000 -0.268 -0.142 RM 7.3804 0.402 18.382 0.000 6.592 8.169 ============================================================================== Omnibus: 234.656 Durbin-Watson: 0.830 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2020.689 Skew: 1.815 Prob(JB): 0.00 Kurtosis: 12.092 Cond. No. 311. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. The F-statistic is very large and the p-value for it is very small, meaning that the model coefficients are statistically significant and probably not arrived at by chance. Again, this is not necessarily an indication of the quality of the model. The $R^2$ value is better than when fitting PTRATIO alone, but it is still not very close to 1. I would conclude then that this model has improved on the last model, but further improvement could be possible. All variables used are found to be significant in predicting housing price except the intercept. Since the intercept is negative, I would agree with the p-value that the fitted coefficient is probably a poor guess. ## Part 4: Comparing Models During modeling, there will be times when we want to compare models to see which one is more predictive or fits the data better. There are many ways to compare models, but we will focus on two. ### The $F$-Statistic Revisited The $F$-statistic can also be used to compare two *nested* models, that is, two models trained on the same dataset where one of the models contains a *subset* of the variables of the other model. The *full* model contains $K$ variables and the *reduced* model contains a subset of these $K$ variables. This allows us to add additional variables to a base model and then test if adding the variables helped the model fit. $$F = \frac{\left( \frac{RSS_{reduced} - RSS_{full}}{DF_{reduced} - DF_{full}} \right)}{\left( \frac{RSS_{full}}{DF_{full}} \right)}$$ where $DF_x = N - K_x - 1$ where $K_x$ is the number of variables in model $x$. ### Akaike Information Criterion (AIC) Another statistic for comparing two models is AIC, which is based on the likelihood function and takes into account the number of variables in the model. $$AIC = 2 K - 2 \log_e{L}$$ where $L$ is the likelihood of the model. AIC is meaningless in the absolute sense, and is only meaningful when compared to AIC values from other models. Lower values of AIC indicate better fitting models. `statsmodels` provides the AIC in its output.

    Part 4 Checkup Exercises

    Exercise: Find another variable (or two) to add to the model we built in Part 3. Compute the $F$-test comparing the two models as well as the AIC. Which model is better?

    ```python m = ols('PRICE ~ PTRATIO+CRIM+RM+LSTAT+DIS',bos).fit() print(m.summary()) ``` OLS Regression Results ============================================================================== Dep. Variable: PRICE R-squared: 0.696 Model: OLS Adj. R-squared: 0.693 Method: Least Squares F-statistic: 228.7 Date: Sat, 25 May 2019 Prob (F-statistic): 1.10e-126 Time: 10:34:49 Log-Likelihood: -1539.2 No. Observations: 506 AIC: 3090. Df Residuals: 500 BIC: 3116. Df Model: 5 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 22.8945 4.080 5.611 0.000 14.879 30.910 PTRATIO -0.9214 0.116 -7.914 0.000 -1.150 -0.693 CRIM -0.0918 0.031 -2.998 0.003 -0.152 -0.032 RM 4.3326 0.422 10.266 0.000 3.503 5.162 LSTAT -0.6244 0.048 -12.911 0.000 -0.719 -0.529 DIS -0.6201 0.128 -4.845 0.000 -0.872 -0.369 ============================================================================== Omnibus: 182.594 Durbin-Watson: 0.960 Prob(Omnibus): 0.000 Jarque-Bera (JB): 810.523 Skew: 1.561 Prob(JB): 9.93e-177 Kurtosis: 8.357 Cond. No. 442. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. The F-statistic of the two models are 245.2 and 228.7 respectively for the less and more complex model. The AIC went from 3232 to 3090 upon the addition of the two variables. I conclude that both models have a high degree of statistical significance, and that the more complex model has improved on the less complex model because the AIC decreased. ## Part 5: Evaluating the Model via Model Assumptions and Other Issues *** Linear regression makes several assumptions. It is always best to check that these assumptions are valid after fitting a linear regression model.
    • **Linearity**. The dependent variable $Y$ is a linear combination of the regression coefficients and the independent variables $X$. This can be verified with a scatterplot of each $X$ vs. $Y$ and plotting correlations among $X$. Nonlinearity can sometimes be resolved by [transforming](https://onlinecourses.science.psu.edu/stat501/node/318) one or more independent variables, the dependent variable, or both. In other cases, a [generalized linear model](https://en.wikipedia.org/wiki/Generalized_linear_model) or a [nonlinear model](https://en.wikipedia.org/wiki/Nonlinear_regression) may be warranted.
    • **Constant standard deviation**. The SD of the dependent variable $Y$ should be constant for different values of X. We can check this by plotting each $X$ against $Y$ and verifying that there is no "funnel" shape showing data points fanning out as $X$ increases or decreases. Some techniques for dealing with non-constant variance include weighted least squares (WLS), [robust standard errors](https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors), or variance stabilizing transformations.
    • **Normal distribution for errors**. The $\epsilon$ term we discussed at the beginning are assumed to be normally distributed. This can be verified with a fitted values vs. residuals plot and verifying that there is no pattern, and with a quantile plot. $$ \epsilon_i \sim N(0, \sigma^2)$$ Sometimes the distributions of responses $Y$ may not be normally distributed at any given value of $X$. e.g. skewed positively or negatively.
    • **Independent errors**. The observations are assumed to be obtained independently.
      • e.g. Observations across time may be correlated </ul>
      </div> There are some other issues that are important investigate with linear regression models.
      • **Correlated Predictors:** Care should be taken to make sure that the independent variables in a regression model are not too highly correlated. Correlated predictors typically do not majorly affect prediction, but do inflate standard errors of coefficients making interpretation unreliable. Common solutions are dropping the least important variables involved in the correlations, using regularlization, or, when many predictors are highly correlated, considering a dimension reduction technique such as principal component analysis (PCA).
      • **Influential Points:** Data points that have undue influence on the regression model. These points can be high leverage points or outliers. Such points are typically removed and the regression model rerun. </ul> </div>

        Part 5 Checkup Exercises

        Take the reduced model from Part 3 to answer the following exercises. Take a look at [this blog post](http://mpastell.com/2013/04/19/python_regression/) for more information on using statsmodels to construct these plots.

        Exercise: Construct a fitted values versus residuals plot. What does the plot tell you? Are there any violations of the model assumptions?

        Exercise: Construct a quantile plot of the residuals. What does the plot tell you?

        Exercise: What are some advantages and disadvantages of the fitted vs. residual and quantile plot compared to each other?

        Exercise: Identify any outliers (if any) in your model and write a story describing what these outliers might represent.

        Exercise: Construct a leverage plot and identify high leverage points in the model. Write a story explaining possible reasons for the high leverage points.

        Exercise: Remove the outliers and high leverage points from your model and run the regression again. How do the results change?

        ```python # Your turn. m = ols('PRICE ~ PTRATIO+CRIM+RM',bos).fit() plt.scatter(m.resid, m.fittedvalues,alpha = 0.1) plt.xlabel('Residuals') plt.ylabel('Fitted Values') plt.title('Fitted Values vs Residuals') plt.show() ``` ![png](output_94_0.png) ```python m.summary() ```
        OLS Regression Results
        Dep. Variable: PRICE R-squared: 0.594
        Model: OLS Adj. R-squared: 0.592
        Method: Least Squares F-statistic: 245.2
        Date: Sat, 25 May 2019 Prob (F-statistic): 6.15e-98
        Time: 10:34:49 Log-Likelihood: -1612.0
        No. Observations: 506 AIC: 3232.
        Df Residuals: 502 BIC: 3249.
        Df Model: 3
        Covariance Type: nonrobust
        coef std err t P>|t| [0.025 0.975]
        Intercept -3.3707 4.034 -0.836 0.404 -11.296 4.555
        PTRATIO -1.0695 0.133 -8.051 0.000 -1.331 -0.809
        CRIM -0.2050 0.032 -6.399 0.000 -0.268 -0.142
        RM 7.3804 0.402 18.382 0.000 6.592 8.169
        Omnibus: 234.656 Durbin-Watson: 0.830
        Prob(Omnibus): 0.000 Jarque-Bera (JB): 2020.689
        Skew: 1.815 Prob(JB): 0.00
        Kurtosis: 12.092 Cond. No. 311.


        Warnings:
        [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. This graph would ideally have no reslationship between the residuals and the fitted values, since the residuals should be unrelated to the model. The fact that there is a clear downward slope between the fitted values and residuals says that there remains some linear relationships that have not been explained by the model. ```python qresid = [1-np.sum(m.resid > x)/len(m.resid) for x in m.resid] plt.scatter(m.resid, qresid,marker = '.') plt.xlabel('Residuals') plt.ylabel('Percentile residuals') plt.title('Percentile residuals vs Residuals') plt.show() ``` ![png](output_97_0.png) This plot looks normally distributed for the most part, however it also shows that there are about six points that lie outside the main distribution. These points are probably also the ones lying along the line in the residuals vs predicted values plot. The residuals plot is a good overview of if there are patterns in the data that have yet to be explained, but the percentile plot clearly shows which specific points are outliers from normally distributed residuals. ```python bos.loc[list(m.resid[(m.resid > 20)].index),['CRIM','RM','PTRATIO']] ```
        CRIM RM PTRATIO
        365 4.55587 3.561 20.2
        367 13.52220 3.863 20.2
        368 4.89822 4.970 20.2
        369 5.66998 6.683 20.2
        370 6.53876 7.016 20.2
        371 9.23230 6.216 20.2
        372 8.26725 5.875 20.2
        All the outliers have the same Pupil Teacher Ratio. Since there isn't a lot of consistency with the other two variables and 20.2 is at the upper quartile of possible values, I conclude that house prices only go down so much as PTRATIOs grow. ```python from statsmodels.graphics.regressionplots import * plot_leverage_resid2(m) plt.show() ``` ![png](output_101_0.png) ```python out_ind = list(m.resid[(m.resid > 20)].index) out_ind += [380,418,405] bos_out = bos.drop(out_ind) ``` ```python m = ols('PRICE ~ PTRATIO+CRIM+RM',bos_out).fit() plt.scatter(m.resid, m.fittedvalues,alpha = 0.1) plt.xlabel('Residuals') plt.ylabel('Fitted Values') plt.title('Fitted Values vs Residuals') plt.show() ``` ![png](output_103_0.png) ```python m.summary() ```
        OLS Regression Results
        Dep. Variable: PRICE R-squared: 0.717
        Model: OLS Adj. R-squared: 0.715
        Method: Least Squares F-statistic: 414.5
        Date: Sat, 25 May 2019 Prob (F-statistic): 3.19e-134
        Time: 10:34:51 Log-Likelihood: -1468.9
        No. Observations: 496 AIC: 2946.
        Df Residuals: 492 BIC: 2963.
        Df Model: 3
        Covariance Type: nonrobust
        coef std err t P>|t| [0.025 0.975]
        Intercept -6.0479 3.294 -1.836 0.067 -12.519 0.423
        PTRATIO -1.1223 0.107 -10.448 0.000 -1.333 -0.911
        CRIM -0.2556 0.035 -7.267 0.000 -0.325 -0.187
        RM 7.9167 0.336 23.560 0.000 7.256 8.577
        Omnibus: 15.619 Durbin-Watson: 1.098
        Prob(Omnibus): 0.000 Jarque-Bera (JB): 29.793
        Skew: 0.147 Prob(JB): 3.39e-07
        Kurtosis: 4.164 Cond. No. 312.


        Warnings:
        [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. The $R^2$ increases significantly and the AIC drops when the outliers and high leverage points are removed. The model fits better, but an effort should be made to understand what factors made these points outliers. ```python ```