Machine Learning Engineer Nanodegree

Project: Creating Customer Segments

In this project, we will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualizations code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
    print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

In [2]:
# Display a description of the dataset
display(data.describe())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Implementation: Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail.

In [3]:
# Select three indices of your choice you wish to sample from the dataset
indices = [47,20,86]

# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 44466 54259 55571 7782 24171 6465
1 17546 4519 4602 1066 2259 2124
2 22925 73498 32114 987 20070 903

Implementation: Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In [4]:
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
Y = data['Detergents_Paper']
new_data = data.drop('Detergents_Paper', axis = 1)

# Split the data into training and testing sets using the given feature as the target
X_train, X_test, y_train, y_test = train_test_split(new_data, Y, test_size=0.25, random_state = 0)

# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print score
0.681188580995

R^2 of predicting Detergents_Paper feature using others comes out to be 0.674, which suggests there is a relationship between Detergents_Paper and other features. This makes the feature "Detergents_Paper" not that necessary for identifying customers' spending habits.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data.

In [5]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Detergents_Paper, Milk and Grocery have some correlation with each other. This scatter matrix confirms the previous conclusion about Detergents_Paper feature which has a good correlation with Grocery. From the density plots in diagonal it can be seen that most of the data points lie near the origin with a long tale and can be considered as having log-normal distribution.

Data Preprocessing

In this section, we will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.

Implementation: Feature Scaling

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In [6]:
# Scale the data using the natural logarithm
log_data = np.log(data)

# Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Observation

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

Run the code below to see how the sample data has changed after having the natural logarithm applied to it.

In [7]:
# Display the log-transformed sample data
display(log_samples)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 10.702480 10.901524 10.925417 8.959569 10.092909 8.774158
1 9.772581 8.416046 8.434246 6.971669 7.722678 7.661056
2 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723

Implementation: Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In [8]:
# For each feature find the data points with extreme high or low values
outliers  = []
outliers_multi = []
common={}
for feature in log_data.keys():
    
    # Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = np.percentile(log_data[feature],25)
    
    # Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = np.percentile(log_data[feature],75)
    
    # TUse the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    step = 1.5*(Q3-Q1)
    
    # Display the outliers
    print "Data points considered outliers for the feature '{}':".format(feature)
    feature_outliers=log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))].index.values    
    #display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
    display(log_data.iloc[feature_outliers])
    outliers_multi.extend(x for x in feature_outliers if (x in outliers) & (x not in outliers_multi))
    outliers.extend(x for x in feature_outliers if x not in outliers)
    
# OPTIONAL: Select the indices for data points you wish to remove
#outliers  = []

# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
print "Data points that appear as outliers in more than one feature = " , outliers_multi
Data points considered outliers for the feature 'Fresh':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
Data points considered outliers for the feature 'Milk':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
Data points considered outliers for the feature 'Grocery':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
Data points considered outliers for the feature 'Frozen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
Data points considered outliers for the feature 'Detergents_Paper':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
Data points considered outliers for the feature 'Delicatessen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
Data points that appear as outliers in more than one feature =  [154, 65, 75, 66, 128]

Data points that appear as outliers in more than one feature = [154, 65, 75, 66, 128]

All the data points considered as outliers have been removed from the data set. These outliers can represent some specialty restaurants or retailers or maybe they have more than one supplier. The outliers were removed because some clustering algorithms such as k-means are sensitive to the outliers and can result in producing wrong clusters.

Feature Transformation

In this section you will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Implementation: PCA

In [9]:
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA()
pca.fit(good_data)

# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)

# Create a DataFrame for the reduced data
reduced_data=pca.transform(good_data)
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

The cumulative variance explained by the first and second principal components is 0.7252 (72.52% of total variance) and for the first four principal components it increases to 0.9279.

First dimension has the highest weight for customers spending on Detergents_Paper following grocery and milk in second and third highest weights respectively. These types of purchases generally represent customers in retrial business where grocery purchases dominate the other categories. We can assume that the customers in retail will have a higher value for first PCA component when compared with customers with restaurant type establishments.

Second dimension having somewhat opposite pattern compared to the first dimension gives more weight for customers spending on fresh, frozen and delicatessen categories. Customers with restaurant establishments generally require more fresh and frozen product compared to the other product categories such as grocery and detergents_paper, hence we can assume that the customers with higher values in second PCA component represent the restaurant type establishments.

Third PCA dimension gives highest weight to the fresh category while second height being the negative of the delicatessen category. Customers with high values in third PCA dimension are the ones who spend a lot in fresh category and very little in delicatessen category. This might represent some salad bar type restaurants or fresh produce packaging company.

Fourth dimension gives highest weight for the frozen category while giving second highest for the negative of the delicatessen category. This might represent some speciality restaurants requiring lot of frozen products such as ice-cream shops.

In [10]:
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6
0 4.7265 3.4113 0.2381 0.0287 0.5103 0.0973
1 0.9509 0.6209 0.4557 -0.7648 -0.6039 0.4094
2 4.4809 0.8020 1.2612 -0.2571 1.3043 0.7119

Visualizing a Biplot

A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1 and Dimension 2). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.

In [13]:
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x9b78330>

Observation

Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk', 'Grocery' and 'Detergents_Paper', but not so much on the other product categories.

Clustering

Foe clustering we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data.

K-means is a fast and easy to implement algorithm for clustering. It assigns a given data point to a single cluster (tight clustering) and assumes globular clusters. In Gaussian Mixture Model (GMM) clustering method a data Pont is given a vector of probabilities that it can belong to each cluster. This makes it possible to have some data points belonging to more than one cluster (overlapping clusters), which can be very valuable in some real world applications. GMM also doesn’t assume globular clusters hence giving the possibility of elongated clusters. By looking at the previous biplot it can be seen that customer data has no clear globular clusters separating them. It is possible to have several overlapping clusters in the data set; hence we will choose the Gaussian Mixture Model clustering algorithm for clustering the customer data set.

Implementation: Creating Clusters

When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

In the code block below, we implement the following:

  • Fit a clustering algorithm to the reduced_data and assign it to clusterer.
  • Predict the cluster for each data point in reduced_data using clusterer.predict and assign them to preds.
  • Find the cluster centers using the algorithm's respective attribute and assign them to centers.
  • Predict the cluster for each sample data point in pca_samples and assign them sample_preds.
  • Import sklearn.metrics.silhouette_score and calculate the silhouette score of reduced_data against preds.
    • Assign the silhouette score to score and print the result.
In [14]:
# Apply your clustering algorithm of choice to the reduced data 

from sklearn import mixture
clusterer = mixture.GMM(n_components=2, covariance_type='full').fit(reduced_data)

# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)

# Find the cluster centers
centers = clusterer.means_

# Predict the cluster for each transformed sample data point
sample_data = pd.DataFrame(pca_samples, columns = ['Dimension 1', 'Dimension 2'])
sample_preds = clusterer.predict(sample_data)

# Calculate the mean silhouette coefficient for the number of clusters chosen
from sklearn.metrics import silhouette_score
score = silhouette_score(reduced_data, preds)
print score
0.443759414328
In [15]:
for i in range(2,16):
    clusterer_i = mixture.GMM(n_components=i, covariance_type='full').fit(reduced_data)
    preds_i = clusterer_i.predict(reduced_data)
    score_i = silhouette_score(reduced_data, preds_i)
    print "n_components=",i,": score=", score_i
n_components= 2 : score= 0.443759414328
n_components= 3 : score= 0.379374624077
n_components= 4 : score= 0.275286254365
n_components= 5 : score= 0.267906784454
n_components= 6 : score= 0.251793766518
n_components= 7 : score= 0.255053650222
n_components= 8 : score= 0.303151905415
n_components= 9 : score= 0.261059462425
n_components= 10 : score= 0.156643441003
n_components= 11 : score= 0.116171891052
n_components= 12 : score= 0.0609385262458
n_components= 13 : score= 0.115974236454
n_components= 14 : score= 0.0665847076569
n_components= 15 : score= 0.144029491218

Best silhouette score is when number of clusters is 2.

Cluster Visualization

In [16]:
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)

Implementation: Data Recovery

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

In [17]:
# Inverse transform the centers
log_centers = pca.inverse_transform(clusterer.means_)

# Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Segment 0 9658.0 1955.0 2467.0 2250.0 308.0 785.0
Segment 1 5525.0 6843.0 10029.0 1154.0 3525.0 1073.0
In [22]:
import matplotlib.pyplot as plt
compare = true_centers.copy()
compare.loc[true_centers.shape[0]] = data.median()

compare.plot(kind='bar')
labels = true_centers.index.values.tolist()
labels.append("Data Median")
plt.xticks(range(compare.shape[0]),labels)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()

Above graph shows the customer spending on each category for segment 0 and segment 1 cluster centers along with median values for the entire data set. When comparing true centers for segment 0 and 1 with data median, it can be seen that center of segment 0 has above median values for fresh and frozen categories while below median for other categories. This type of large fresh and frozen purchases can be attributed to the restaurant industry where more fresh and frozen produce are required compared to the other categories such as detergents or grocery. On the other hand, segment 1 has below median in fresh and frozen categories while above median in the other categories suggesting a retail type of establishment where grocery being the high volume consumer category.

In [18]:
display(samples)
# Display the predictions
for i, pred in enumerate(sample_preds):
    print "Sample point", i, "predicted to be in Cluster", pred
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 44466 54259 55571 7782 24171 6465
1 17546 4519 4602 1066 2259 2124
2 22925 73498 32114 987 20070 903
Sample point 0 predicted to be in Cluster 1
Sample point 1 predicted to be in Cluster 1
Sample point 2 predicted to be in Cluster 1

Predictions for 3 selected customers are,

  • customer 0 - retailer
  • customer 1 - small restaurant
  • customer 2 - dairy bar or ice-cream manufacturer

Clustering results from GMM categorized all of them as cluster 1 which represents a retailer type establishment. When comparing the grocery and detergents_paper features of three samples with those in cluster centers it can be seen that all three samples have high values for grocery and detergents_paper features (as same as cluster center for cluster 1) suggesting them as retail type of establishments rather than restaurants or cafes.

Conclusion

In this final section, you will investigate ways that you can make use of the clustered data. First, you will consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme. Next, you will consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data. Finally, you will compare the customer segments to a hidden variable present in the data, to see whether the clustering identified certain relationships.

Question

Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively. How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?
Hint: Can we assume the change affects all customers equally? How can we determine which group of customers it affects the most?

Answer:

We cannot assume that the change will affect all customers equally. To identify which customers will be affected most we have to look at the amount they spent on each category. For customers who are buying lots of fresh produce or milk they will expect them to be freshly delivered requiring 5 days delivery. But customers who are mostly buying Detergents_Paper or grocery might not get affected adversely if the delivery changed from 5 days to 3 days. Then again some customers might not have sufficient storage facilities to store the large quantities if the delivery frequency dropped. Even though the delivery service runs 5 timed a week all the customers are probably not served every time. It’s best to have some data on the number of deliveries per week for each customer. Such data will be very useful in this task.

To perform an A/B test we can select a sample of customers from each segment and change the delivery frequency for them. Then get the feedback to see which segment of customers has reacted badly to the new delivery scheme.

Visualizing Underlying Distributions

At the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset. The sample points are circled in the plot, which will identify their labeling.

In [19]:
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)

Highest silhouette score obtained for 2 clusters accurately represent 2 underlying clusters; Hotel/Restaurant/Cafe customers and Retailer customers. Customers with pca dimension 1 less than 0 can be identified as Hotels/Restaurants/Cafes while greater than 2 can be identified as Retailers. Customer with dimension 1 value in-between 0 and 2 can be identified as some mixture between two segments. These two clusters have been properly identified by the GMM clustering mechanism as shown in previous Cluster Visualization graph. Previous definition of two clusters is consistent with the underlying distribution given by the 'Channel' feature.