Code 401 Class 13 Reading Notes

How to Run Linear Regression in Python

Using Scikit-learn to perform Linear Regression

Scikit-learn is a Python module for machine learning, which contains function for regression, classification, clustering, model selection and dimensionality reduction.

Exploring Data Sets

First Step: import required python libraries

import numpy as np
import pandas as pd
import scipy.stats as stats
import matlotlib.pyplot as plt
import sklearn

Store the data set int a lpython notebook.

from sklearn.datasets import load_boston
boston = load_boston()

boston.keys() explores the keys of the dictionary. output: ['data', 'feature_names', 'DESCR', 'target']

View each key by printing boston + .data or .feature_names and so on.

Convert boston.data into a pandas data frame(use feature_names to see column names and not numbers):

bos = pd.DataFrame(boston.data)
bos.head()

Adding target prices to the bos data frame with two steps

boston.target[:5]
bos[PRICE] = boston.target

pandas.DataFrame.drop: Drop removes specified labels from rows or columns.

Importing Linear Regression

- from sklearn.linear_model import LinearRegression
- x = bos.drop('PRICE',axis = 1)

#This creates a LinearRegression object
lm = LinearRegression()
lm

lm.fit() -> fits a linear model
lm.predict() -> Predict Y using the linear model with estimated coefficients
lm.score() -> Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model.
lm.coef_ gives the coefficients
lm.intercept_ gives the estimated intercepts.
lm.predict -> calculates predicted prices

To plot predicted prices

plt.scatter(bos.PRICE, lm.predict(X))
plt.xlabel('Prices: $Y_i$')
plt.ylabel('Predicted prices: $\hat{Y}_i$')
plt.title('Prices vs Predicted Prices: $Y_i$ vs $\hat{Y}_i$')

Output:

Predicted prices

Train-test split

X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X, bos.PRICE, test_size=0.33, random_state=5) print X_train.shape print X_test.shape print Y_train.shape print Y_test.shape

Then build a linear regression model using train-test data sets.

lm = LinearRegression() lm.fit(X_train, Y_train) pred_train = lm.predict(X_train) pred_test = lm.predict(X_test)

The calculate the mean squared error for training and test data

Input:
- print “Fit a model X_train, and calculate MSE with Y_train:”, np.mean((Y_train – lm.predict(X_train)) ** 2)
- print “Fit a model X_train, and calculate MSE with X_test, Y_test:”, np.mean((Y_test – lm.predict(X_test)) ** 2)
Output:
- Fit a model X_train, and calculate MSE with Y_train: 19.5467584735 Fit a model X_train, and calculate MSE with X_test, Y_test: 28.5413672756

Things I want to know more about

Statistics, I should have saved some notes when I took it. Most of this is foreign language, requiring more research to what the article is talking about. Can the plot be interactive? User add values and immediately manipulates data?

<—BACK