Data Scientist
This mini-project is based on this blog post by yhat. Please feel free to refer to the post for additional information, and solutions.
%matplotlib inline
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")
The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()
/anaconda3/lib/python3.7/site-packages/pandas/io/excel.py:329: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
**kwds)
offer_id | campaign | varietal | min_qty | discount | origin | past_peak | |
---|---|---|---|---|---|---|---|
0 | 1 | January | Malbec | 72 | 56 | France | False |
1 | 2 | January | Pinot Noir | 72 | 17 | France | False |
2 | 3 | February | Espumante | 144 | 32 | Oregon | True |
3 | 4 | February | Champagne | 72 | 48 | France | True |
4 | 5 | February | Cabernet Sauvignon | 144 | 44 | New Zealand | True |
We see that the first dataset contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak. The second dataset in the second worksheet contains transactional data – which offer each customer responded to.
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()
customer_name | offer_id | n | |
---|---|---|---|
0 | Smith | 2 | 1 |
1 | Smith | 24 | 1 |
2 | Johnson | 17 | 1 |
3 | Johnson | 24 | 1 |
4 | Johnson | 26 | 1 |
We’re trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.
The first thing we need is a way to compare customers. To do this, we’re going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.
Exercise: Create a data frame where each row has the following columns (Use the pandas [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`pivot_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) functions for this purpose):
Make sure you also deal with any weird values such as `NaN`. Read the documentation to develop your solution.
</div> ```python #your turn df_transactions_p = df_transactions.pivot_table(index = 'customer_name', columns = 'offer_id', values = 'n', aggfunc = np.sum, fill_value = 0, dropna = True ) df_transactions_p.head() ```offer_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
customer_name | |||||||||||||||||||||
Adams | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
Allen | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Anderson | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Bailey | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Baker | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 32 columns
Exercise:
| Range | Interpretation | |-------------|-----------------------------------------------| | 0.71 - 1.0 | A strong structure has been found. | | 0.51 - 0.7 | A reasonable structure has been found. | | 0.26 - 0.5 | The structure is weak and could be artificial.| | < 0.25 | No substantial structure has been found. |Source: http://www.stat.berkeley.edu/~spector/s133/Clus.html Fortunately, scikit-learn provides a function to compute this for us (phew!) called [`sklearn.metrics.silhouette_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html). Take a look at [this article](http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) on picking $K$ in scikit-learn, as it will help you in the next exercise set.
Exercise: Using the documentation for the `silhouette_score` function above, construct a series of silhouette plots like the ones in the article linked above.
Exercise: Compute the average silhouette score for each $K$ and plot it. What $K$ does the plot suggest we should choose? Does it differ from what we found using the Elbow method?
Exercise: Use PCA to plot your clusters:
Exercise: Now look at both the original raw data about the offers and transactions and look at the fitted clusters. Tell a story about the clusters in context of the original data. For example, do the clusters correspond to wine variants or something else interesting?
</div> ```python #your turn from sklearn.decomposition import PCA pca = PCA(n_components = 2) transactions_pca = pca.fit_transform(transactions) clusters = range(2,11) model = KMeans(n_clusters = 1) model.fit(transactions_pca) customer_name = df_transactions_p.index cluster_id = model.labels_ x = transactions_pca[:,0] y = transactions_pca[:,1] df = pd.DataFrame(data = {'n_clusters': 1, 'customer_name': customer_name, 'cluster_id': cluster_id, 'x': x, 'y': y }) for cluster in clusters: model = KMeans(n_clusters = cluster) model.fit(transactions_pca) customer_name = df_transactions_p.index cluster_id = model.labels_ x = transactions_pca[:,0] y = transactions_pca[:,1] new_df = pd.DataFrame(data = {'n_clusters': cluster, 'customer_name': customer_name, 'cluster_id': cluster_id, 'x': x, 'y': y }) df = df.append(new_df,ignore_index = True) df = df.reset_index() ``` ```python g = sns.FacetGrid(df, col = 'n_clusters', col_wrap = 5,hue = 'cluster_id') g.map(sns.scatterplot,'x','y') plt.show() ```  The clustering seems better for the sillouhette prediction (6 or 10). At n = 4 the clusters are some very odd shapes. At n = 6, the clusters seem to represent a somewhat even subdivision of the space. What we've done is we've taken those columns of 0/1 indicator variables, and we've transformed them into a 2-D dataset. We took one column and arbitrarily called it `x` and then called the other `y`. Now we can throw each point into a scatterplot. We color coded each point based on it's cluster so it's easier to see them.As we saw earlier, PCA has a lot of other uses. Since we wanted to visualize our data in 2 dimensions, restricted the number of dimensions to 2 in PCA. But what is the true optimal number of dimensions?
Exercise: Using a new PCA object shown in the next cell, plot the `explained_variance_` field and look for the elbow point, the point where the curve's rate of descent seems to slow sharply. This value is one possible value for the optimal number of dimensions. What is it?
Method name | Parameters | Scalability | Use Case | Geometry (metric used) |
---|---|---|---|---|
K-Means</span></a> | number of clusters | Very largen_samples, medium n_clusters with MiniBatch code | General-purpose, even cluster size, flat geometry, not too many clusters | Distances between points |
Affinity propagation | damping, sample preference | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Mean-shift | bandwidth | Not scalable with n_samples | Many clusters, uneven cluster size, non-flat geometry | Distances between points |
Spectral clustering | number of clusters | Medium n_samples, small n_clusters | Few clusters, even cluster size, non-flat geometry | Graph distance (e.g. nearest-neighbor graph) |
Ward hierarchical clustering | number of clusters | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints | Distances between points |
Agglomerative clustering | number of clusters, linkage type, distance | Large n_samples and n_clusters | Many clusters, possibly connectivity constraints, non Euclidean distances | Any pairwise distance |
DBSCAN | neighborhood size | Very large n_samples, medium n_clusters | Non-flat geometry, uneven cluster sizes | Distances between nearest points |
Gaussian mixtures | many | Not scalable | Flat geometry, good for density estimation | Mahalanobis distances to centers |
Birch | branching factor, threshold, optional global clusterer. | Large n_clusters and n_samples | Large dataset, outlier removal, data reduction. | Euclidean distance between points |
Exercise: Try clustering using the following algorithms.
How do their results compare? Which performs the best? Tell a story why you think it performs the best.
</div> ```python # Your turn from sklearn.cluster import AffinityPropagation,SpectralClustering,AgglomerativeClustering,DBSCAN models = [AffinityPropagation(),SpectralClustering(),AgglomerativeClustering(),DBSCAN(eps = 1)] sil_score = [] transactions = np.array(df_transactions_p) for model in models: model.fit(transactions) labels = model.labels_ #labels,label_count = np.unique(labels,return_counts = True) #try: # print(labels,label_count,'\n') #except: print(np.unique(labels)) sil_score.append(silhouette_score(transactions,labels)) print(sil_score) ``` [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13] [0 1 2 3 4 5 6 7] [0 1] [-1 0 1 2 3] [0.12346523604478911, 0.0983411795884811, 0.08258017823184984, 0.012715203274911742] The Agglomerative clustering with default values faired the best. It found the most clusters of any of the algorithms. The other algorithms claim to do better with large n, whereas this is not an enormous number of points.