Initialization and data loading¶

Load required packages:

import numpy as np
import pandas as pd
from google.colab import drive
import matplotlib.pyplot as plt
from sklearn import metrics

%matplotlib inline
from matplotlib.pylab import rcParams
import seaborn as sns

import warnings
import itertools
warnings.filterwarnings("ignore") # specify to ignore warning messages

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

1. Example with customers dataset¶

In this section, we study a store's customer data set. The customers have been rated with a score ranging from 1 to 100 (Spending Score (1-100)) according to their purchase frequency and other conditions.

data_x = pd.read_csv('/content/customers.csv')

data_x.head(5)

Let's set "Spending Score (1-100)" as our target variable to be predicted.

data_x.describe()

Split our data into training and test sets.

x = np.arange(data_x.shape[0]).reshape((-1,1))
y= data_x['Spending Score (1-100)'].values.reshape((-1,1))

from sklearn.model_selection import train_test_split

# split training data 82.5:17.5 into training:testing sets
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.70, random_state=123)

print("len(X): {} len(y): {} \nlen(X_train): {}, len(X_test): \
{} \nlen(y_train): {},  len(y_test): {}".format(len(x), len(y),\
len(X_train), len(X_test), len(y_train), \
len(y_test)))

len(X): 200 len(y): 200 
len(X_train): 140, len(X_test): 60 
len(y_train): 140,  len(y_test): 60

#Train the model
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#predictions on test dataset
pred = regressor.predict(X_test)

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
print('MAE:', metrics.mean_absolute_error(y_test, pred))

RMSE: 27.665954214629625
MAE: 22.866781059345378

plt.figure(figsize=(4, 3))
plt.scatter(y_test, pred)
plt.axis('tight')
plt.xlabel('True price')
plt.ylabel('Predicted price')
plt.tight_layout()

sns.distplot((y_test - pred), bins=50);

2 Example with avocado dataset¶

data = pd.read_csv('/content/avocado.csv')

data.head(5)

Some relevant columns in the dataset:

Date - The date of the observation

AveragePrice - the average price of a single avocado

X = np.arange(data.shape[0]).reshape((-1,1))
Y= data['AveragePrice'].values.reshape((-1,1))

# split training data 82.5:17.5 into training:testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, train_size=0.90, random_state=123)

print("len(X): {} len(y): {} \nlen(X_train): {}, len(X_test): \
{} \nlen(y_train): {},  len(y_test): {}".format(len(X), len(Y),\
len(X_train), len(X_test), len(y_train), \
len(y_test)))

len(X): 18249 len(y): 18249 
len(X_train): 16424, len(X_test): 1825 
len(y_train): 16424,  len(y_test): 1825

#Train the model
regressor2 = LinearRegression()  
regressor2.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

#Predictions on test data
pred2 = regressor2.predict(X_test)

print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, pred2)))
print('MAE:', metrics.mean_absolute_error(y_test, pred2))

RMSE: 0.32655315170392324
MAE: 0.2549143995822833

plt.figure(figsize=(7, 4))
plt.scatter(y_test, pred2)
plt.axis('tight')
plt.xlabel('True AveragePrice')
plt.ylabel('Predicted AveragePrice')
plt.tight_layout()

sns.distplot((y_test - pred2), bins=50);

	CustomerID	Age	Annual Income (k$)	Spending Score (1-100)
count	200.000000	200.000000	200.000000	200.000000
mean	100.500000	38.850000	60.560000	50.200000
std	57.879185	13.969007	26.264721	25.823522
min	1.000000	18.000000	15.000000	1.000000
25%	50.750000	28.750000	41.500000	34.750000
50%	100.500000	36.000000	61.500000	50.000000
75%	150.250000	49.000000	78.000000	73.000000
max	200.000000	70.000000	137.000000	99.000000

	Unnamed: 0	Date	AveragePrice	Total Volume	4046	4225	4770	Total Bags	Small Bags	Large Bags	type	year	region
0	0	2015-12-27	1.33	64236.62	1036.74	54454.85	48.16	8696.87	8603.62	93.25	conventional	2015	Albany
1	1	2015-12-20	1.35	54876.98	674.28	44638.81	58.33	9505.56	9408.07	97.49	conventional	2015	Albany
2	2	2015-12-13	0.93	118220.22	794.70	109149.67	130.50	8145.35	8042.21	103.14	conventional	2015	Albany
3	3	2015-12-06	1.08	78992.15	1132.00	71976.41	72.58	5811.16	5677.40	133.76	conventional	2015	Albany
4	4	2015-11-29	1.28	51039.60	941.48	43838.39	75.78	6183.95	5986.26	197.69	conventional	2015	Albany

	CustomerID	Genre	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40