In this project, we will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.
The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel'
and 'Region'
will be excluded in the analysis — with focus instead on the six product categories recorded for customers.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the wholesale customers dataset
try:
data = pd.read_csv("customers.csv")
data.drop(['Region', 'Channel'], axis = 1, inplace = True)
print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
print "Dataset could not be loaded. Is the dataset missing?"
# Display a description of the dataset
display(data.describe())
To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail.
# Select three indices of your choice you wish to sample from the dataset
indices = [47,20,86]
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Make a copy of the DataFrame, using the 'drop' function to drop the given feature
Y = data['Detergents_Paper']
new_data = data.drop('Detergents_Paper', axis = 1)
# Split the data into training and testing sets using the given feature as the target
X_train, X_test, y_train, y_test = train_test_split(new_data, Y, test_size=0.25, random_state = 0)
# Create a decision tree regressor and fit it to the training set
regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)
# Report the score of the prediction using the testing set
score = regressor.score(X_test, y_test)
print score
R^2 of predicting Detergents_Paper feature using others comes out to be 0.674, which suggests there is a relationship between Detergents_Paper and other features. This makes the feature "Detergents_Paper" not that necessary for identifying customers' spending habits.
To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data.
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
Detergents_Paper, Milk and Grocery have some correlation with each other. This scatter matrix confirms the previous conclusion about Detergents_Paper feature which has a good correlation with Grocery. From the density plots in diagonal it can be seen that most of the data points lie near the origin with a long tale and can be considered as having log-normal distribution.
In this section, we will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.
If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.
# Scale the data using the natural logarithm
log_data = np.log(data)
# Scale the sample data using the natural logarithm
log_samples = np.log(samples)
# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');
After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).
Run the code below to see how the sample data has changed after having the natural logarithm applied to it.
# Display the log-transformed sample data
display(log_samples)
Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.
# For each feature find the data points with extreme high or low values
outliers = []
outliers_multi = []
common={}
for feature in log_data.keys():
# Calculate Q1 (25th percentile of the data) for the given feature
Q1 = np.percentile(log_data[feature],25)
# Calculate Q3 (75th percentile of the data) for the given feature
Q3 = np.percentile(log_data[feature],75)
# TUse the interquartile range to calculate an outlier step (1.5 times the interquartile range)
step = 1.5*(Q3-Q1)
# Display the outliers
print "Data points considered outliers for the feature '{}':".format(feature)
feature_outliers=log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))].index.values
#display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
display(log_data.iloc[feature_outliers])
outliers_multi.extend(x for x in feature_outliers if (x in outliers) & (x not in outliers_multi))
outliers.extend(x for x in feature_outliers if x not in outliers)
# OPTIONAL: Select the indices for data points you wish to remove
#outliers = []
# Remove the outliers, if any were specified
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)
print "Data points that appear as outliers in more than one feature = " , outliers_multi
Data points that appear as outliers in more than one feature = [154, 65, 75, 66, 128]
All the data points considered as outliers have been removed from the data set. These outliers can represent some specialty restaurants or retailers or maybe they have more than one supplier. The outliers were removed because some clustering algorithms such as k-means are sensitive to the outliers and can result in producing wrong clusters.
In this section you will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.
from sklearn.decomposition import PCA
# Apply PCA by fitting the good data with the same number of dimensions as features
pca = PCA()
pca.fit(good_data)
# Transform log_samples using the PCA fit above
pca_samples = pca.transform(log_samples)
# Generate PCA results plot
pca_results = vs.pca_results(good_data, pca)
# Create a DataFrame for the reduced data
reduced_data=pca.transform(good_data)
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])
The cumulative variance explained by the first and second principal components is 0.7252 (72.52% of total variance) and for the first four principal components it increases to 0.9279.
First dimension has the highest weight for customers spending on Detergents_Paper following grocery and milk in second and third highest weights respectively. These types of purchases generally represent customers in retrial business where grocery purchases dominate the other categories. We can assume that the customers in retail will have a higher value for first PCA component when compared with customers with restaurant type establishments.
Second dimension having somewhat opposite pattern compared to the first dimension gives more weight for customers spending on fresh, frozen and delicatessen categories. Customers with restaurant establishments generally require more fresh and frozen product compared to the other product categories such as grocery and detergents_paper, hence we can assume that the customers with higher values in second PCA component represent the restaurant type establishments.
Third PCA dimension gives highest weight to the fresh category while second height being the negative of the delicatessen category. Customers with high values in third PCA dimension are the ones who spend a lot in fresh category and very little in delicatessen category. This might represent some salad bar type restaurants or fresh produce packaging company.
Fourth dimension gives highest weight for the frozen category while giving second highest for the negative of the delicatessen category. This might represent some speciality restaurants requiring lot of frozen products such as ice-cream shops.
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
A biplot is a scatterplot where each data point is represented by its scores along the principal components. The axes are the principal components (in this case Dimension 1
and Dimension 2
). In addition, the biplot shows the projection of the original features along the components. A biplot can help us interpret the reduced dimensions of the data, and discover relationships between the principal components and original features.
# Create a biplot
vs.biplot(good_data, reduced_data, pca)
Once we have the original feature projections (in red), it is easier to interpret the relative position of each data point in the scatterplot. For instance, a point the lower right corner of the figure will likely correspond to a customer that spends a lot on 'Milk'
, 'Grocery'
and 'Detergents_Paper'
, but not so much on the other product categories.
Foe clustering we will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data.
K-means is a fast and easy to implement algorithm for clustering. It assigns a given data point to a single cluster (tight clustering) and assumes globular clusters. In Gaussian Mixture Model (GMM) clustering method a data Pont is given a vector of probabilities that it can belong to each cluster. This makes it possible to have some data points belonging to more than one cluster (overlapping clusters), which can be very valuable in some real world applications. GMM also doesn’t assume globular clusters hence giving the possibility of elongated clusters. By looking at the previous biplot it can be seen that customer data has no clear globular clusters separating them. It is possible to have several overlapping clusters in the data set; hence we will choose the Gaussian Mixture Model clustering algorithm for clustering the customer data set.
When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.
In the code block below, we implement the following:
reduced_data
and assign it to clusterer
.reduced_data
using clusterer.predict
and assign them to preds
.centers
.pca_samples
and assign them sample_preds
.sklearn.metrics.silhouette_score
and calculate the silhouette score of reduced_data
against preds
.score
and print the result.# Apply your clustering algorithm of choice to the reduced data
from sklearn import mixture
clusterer = mixture.GMM(n_components=2, covariance_type='full').fit(reduced_data)
# Predict the cluster for each data point
preds = clusterer.predict(reduced_data)
# Find the cluster centers
centers = clusterer.means_
# Predict the cluster for each transformed sample data point
sample_data = pd.DataFrame(pca_samples, columns = ['Dimension 1', 'Dimension 2'])
sample_preds = clusterer.predict(sample_data)
# Calculate the mean silhouette coefficient for the number of clusters chosen
from sklearn.metrics import silhouette_score
score = silhouette_score(reduced_data, preds)
print score
for i in range(2,16):
clusterer_i = mixture.GMM(n_components=i, covariance_type='full').fit(reduced_data)
preds_i = clusterer_i.predict(reduced_data)
score_i = silhouette_score(reduced_data, preds_i)
print "n_components=",i,": score=", score_i
Best silhouette score is when number of clusters is 2.
# Display the results of the clustering from implementation
vs.cluster_results(reduced_data, preds, centers, pca_samples)
Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.
# Inverse transform the centers
log_centers = pca.inverse_transform(clusterer.means_)
# Exponentiate the centers
true_centers = np.exp(log_centers)
# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
import matplotlib.pyplot as plt
compare = true_centers.copy()
compare.loc[true_centers.shape[0]] = data.median()
compare.plot(kind='bar')
labels = true_centers.index.values.tolist()
labels.append("Data Median")
plt.xticks(range(compare.shape[0]),labels)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
Above graph shows the customer spending on each category for segment 0 and segment 1 cluster centers along with median values for the entire data set. When comparing true centers for segment 0 and 1 with data median, it can be seen that center of segment 0 has above median values for fresh and frozen categories while below median for other categories. This type of large fresh and frozen purchases can be attributed to the restaurant industry where more fresh and frozen produce are required compared to the other categories such as detergents or grocery. On the other hand, segment 1 has below median in fresh and frozen categories while above median in the other categories suggesting a retail type of establishment where grocery being the high volume consumer category.
display(samples)
# Display the predictions
for i, pred in enumerate(sample_preds):
print "Sample point", i, "predicted to be in Cluster", pred
Predictions for 3 selected customers are,
Clustering results from GMM categorized all of them as cluster 1 which represents a retailer type establishment. When comparing the grocery and detergents_paper features of three samples with those in cluster centers it can be seen that all three samples have high values for grocery and detergents_paper features (as same as cluster center for cluster 1) suggesting them as retail type of establishments rather than restaurants or cafes.
In this final section, you will investigate ways that you can make use of the clustered data. First, you will consider how the different groups of customers, the customer segments, may be affected differently by a specific delivery scheme. Next, you will consider how giving a label to each customer (which segment that customer belongs to) can provide for additional features about the customer data. Finally, you will compare the customer segments to a hidden variable present in the data, to see whether the clustering identified certain relationships.
Companies will often run A/B tests when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively. How can the wholesale distributor use the customer segments to determine which customers, if any, would react positively to the change in delivery service?
Hint: Can we assume the change affects all customers equally? How can we determine which group of customers it affects the most?
Answer:
We cannot assume that the change will affect all customers equally. To identify which customers will be affected most we have to look at the amount they spent on each category. For customers who are buying lots of fresh produce or milk they will expect them to be freshly delivered requiring 5 days delivery. But customers who are mostly buying Detergents_Paper or grocery might not get affected adversely if the delivery changed from 5 days to 3 days. Then again some customers might not have sufficient storage facilities to store the large quantities if the delivery frequency dropped. Even though the delivery service runs 5 timed a week all the customers are probably not served every time. It’s best to have some data on the number of deliveries per week for each customer. Such data will be very useful in this task.
To perform an A/B test we can select a sample of customers from each segment and change the delivery frequency for them. Then get the feedback to see which segment of customers has reacted badly to the new delivery scheme.
At the beginning of this project, it was discussed that the 'Channel'
and 'Region'
features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel'
feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier to the original dataset. The sample points are circled in the plot, which will identify their labeling.
# Display the clustering results based on 'Channel' data
vs.channel_results(reduced_data, outliers, pca_samples)
Highest silhouette score obtained for 2 clusters accurately represent 2 underlying clusters; Hotel/Restaurant/Cafe customers and Retailer customers. Customers with pca dimension 1 less than 0 can be identified as Hotels/Restaurants/Cafes while greater than 2 can be identified as Retailers. Customer with dimension 1 value in-between 0 and 2 can be identified as some mixture between two segments. These two clusters have been properly identified by the GMM clustering mechanism as shown in previous Cluster Visualization graph. Previous definition of two clusters is consistent with the underlying distribution given by the 'Channel' feature.