Initialization and data loading

Load required packages:

In [ ]:
import numpy as np
import pandas as pd
from google.colab import drive
import matplotlib.pyplot as plt
from sklearn import metrics

%matplotlib inline
from matplotlib.pylab import rcParams
import seaborn as sns
In [ ]:
import warnings
import itertools
warnings.filterwarnings("ignore") # specify to ignore warning messages
In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

1. Example with customers dataset

In this section, we study a store's customer data set. The customers have been rated with a score ranging from 1 to 100 (Spending Score (1-100)) according to their purchase frequency and other conditions.

In [ ]:
data_x = pd.read_csv('/content/customers.csv')

data_x.head(5)
Out[ ]:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40

Let's set "Spending Score (1-100)" as our target variable to be predicted.

In [ ]:
data_x.describe()
Out[ ]:
CustomerID Age Annual Income (k$) Spending Score (1-100)
count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 38.850000 60.560000 50.200000
std 57.879185 13.969007 26.264721 25.823522
min 1.000000 18.000000 15.000000 1.000000
25% 50.750000 28.750000 41.500000 34.750000
50% 100.500000 36.000000 61.500000 50.000000
75% 150.250000 49.000000 78.000000 73.000000
max 200.000000 70.000000 137.000000 99.000000

Split our data into training and test sets.

In [ ]:
x = np.arange(data_x.shape[0]).reshape((-1,1))
y= data_x['Spending Score (1-100)'].values.reshape((-1,1))
In [ ]:
from sklearn.model_selection import train_test_split

# split training data 82.5:17.5 into training:testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.70, random_state=123)

print("len(X): {} len(y): {} \nlen(X_train): {}, len(X_test): \
{} \nlen(y_train): {},  len(y_test): {}".format(len(x), len(y),\
len(X_train), len(X_test), len(y_train), \
len(y_test)))
len(X): 200 len(y): 200 
len(X_train): 140, len(X_test): 60 
len(y_train): 140,  len(y_test): 60
In [ ]:
#Train the model
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[ ]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [ ]:
#predictions on test dataset
pred = regressor.predict(X_test)
In [ ]:
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
print('MAE:', metrics.mean_absolute_error(y_test, pred))
RMSE: 27.665954214629625
MAE: 22.866781059345378
In [ ]:
plt.figure(figsize=(4, 3))
plt.scatter(y_test, pred)
plt.axis('tight')
plt.xlabel('True price')
plt.ylabel('Predicted price')
plt.tight_layout()
In [ ]:
sns.distplot((y_test - pred), bins=50);

2 Example with avocado dataset

In [ ]:
data = pd.read_csv('/content/avocado.csv')

data.head(5)
Out[ ]:
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 48.16 8696.87 8603.62 93.25 0.0 conventional 2015 Albany
1 1 2015-12-20 1.35 54876.98 674.28 44638.81 58.33 9505.56 9408.07 97.49 0.0 conventional 2015 Albany
2 2 2015-12-13 0.93 118220.22 794.70 109149.67 130.50 8145.35 8042.21 103.14 0.0 conventional 2015 Albany
3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 72.58 5811.16 5677.40 133.76 0.0 conventional 2015 Albany
4 4 2015-11-29 1.28 51039.60 941.48 43838.39 75.78 6183.95 5986.26 197.69 0.0 conventional 2015 Albany

Some relevant columns in the dataset:

Date - The date of the observation

AveragePrice - the average price of a single avocado

In [ ]:
X = np.arange(data.shape[0]).reshape((-1,1))
Y= data['AveragePrice'].values.reshape((-1,1))
In [ ]:
# split training data 82.5:17.5 into training:testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.90, random_state=123)

print("len(X): {} len(y): {} \nlen(X_train): {}, len(X_test): \
{} \nlen(y_train): {},  len(y_test): {}".format(len(X), len(Y),\
len(X_train), len(X_test), len(y_train), \
len(y_test)))
len(X): 18249 len(y): 18249 
len(X_train): 16424, len(X_test): 1825 
len(y_train): 16424,  len(y_test): 1825
In [ ]:
#Train the model
regressor2 = LinearRegression()  
regressor2.fit(X_train, y_train)
Out[ ]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [ ]:
#Predictions on test data
pred2 = regressor2.predict(X_test)
In [ ]:
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred2)))
print('MAE:', metrics.mean_absolute_error(y_test, pred2))
RMSE: 0.32655315170392324
MAE: 0.2549143995822833
In [ ]:
plt.figure(figsize=(7, 4))
plt.scatter(y_test, pred2)
plt.axis('tight')
plt.xlabel('True AveragePrice')
plt.ylabel('Predicted AveragePrice')
plt.tight_layout()
In [ ]:
sns.distplot((y_test - pred2), bins=50);