Code 401 Class 13 Reading Notes
How to Run Linear Regression in Python
Using Scikit-learn to perform Linear Regression
Scikit-learn is a Python module for machine learning, which contains function for regression, classification, clustering, model selection and dimensionality reduction.
Exploring Data Sets
First Step: import required python libraries
import numpy as np
import pandas as pd
import scipy.stats as stats
import matlotlib.pyplot as plt
import sklearn
Store the data set int a lpython notebook.
from sklearn.datasets import load_boston
boston = load_boston()
boston.keys()
explores the keys of the dictionary.
output: ['data', 'feature_names', 'DESCR', 'target']
View each key by printing boston
+ .data
or .feature_names
and so on.
Convert boston.data into a pandas data frame(use feature_names
to see column names and not numbers):
bos = pd.DataFrame(boston.data)
bos.head()
Adding target prices to the bos data frame with two steps
- boston.target[:5]
- bos[
PRICE
] = boston.target
pandas.DataFrame.drop: Drop removes specified labels from rows or columns.
Importing Linear Regression
- from sklearn.linear_model import LinearRegression
- x = bos.drop('PRICE',axis = 1)
#This creates a LinearRegression object
lm = LinearRegression()
lm
- lm.fit() -> fits a linear model
- lm.predict() -> Predict Y using the linear model with estimated coefficients
- lm.score() -> Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model.
- lm.coef_ gives the coefficients
- lm.intercept_ gives the estimated intercepts.
- lm.predict -> calculates predicted prices
To plot predicted prices
plt.scatter(bos.PRICE, lm.predict(X))
plt.xlabel('Prices: $Y_i$')
plt.ylabel('Predicted prices: $\hat{Y}_i$')
plt.title('Prices vs Predicted Prices: $Y_i$ vs $\hat{Y}_i$')
Output:
Train-test split
X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(X, bos.PRICE, test_size=0.33, random_state=5) print X_train.shape print X_test.shape print Y_train.shape print Y_test.shape
Then build a linear regression model using train-test data sets.
lm = LinearRegression() lm.fit(X_train, Y_train) pred_train = lm.predict(X_train) pred_test = lm.predict(X_test)
The calculate the mean squared error for training and test data
- Input:
- print “Fit a model X_train, and calculate MSE with Y_train:”, np.mean((Y_train – lm.predict(X_train)) ** 2)
- print “Fit a model X_train, and calculate MSE with X_test, Y_test:”, np.mean((Y_test – lm.predict(X_test)) ** 2)
- Output:
- Fit a model X_train, and calculate MSE with Y_train: 19.5467584735 Fit a model X_train, and calculate MSE with X_test, Y_test: 28.5413672756
Things I want to know more about
Statistics, I should have saved some notes when I took it. Most of this is foreign language, requiring more research to what the article is talking about. Can the plot be interactive? User add values and immediately manipulates data?