The objective is to analyze credit card usage data of a bank's customers to identify their different profiles.
Tools used include Principal Components Analysis, KMeans Clustering and Hierarchical Clustering.
Results of clients' profiles are intended to help the bank to better target its clients with tailored offers and products.
1. About the data
2. Data cleaning
3. PCA reduction
4. Model. Hierarchical clustering, using 4 components
5. Conclusions and recommendations
6. Closing comments
7. References
CC_DF = pd.read_csv('CC GENERAL.csv', sep = ',')
CC_DF.isnull().sum()
CUST_ID 0 BALANCE 0 BALANCE_FREQUENCY 0 PURCHASES 0 ONEOFF_PURCHASES 0 INSTALLMENTS_PURCHASES 0 CASH_ADVANCE 0 PURCHASES_FREQUENCY 0 ONEOFF_PURCHASES_FREQUENCY 0 PURCHASES_INSTALLMENTS_FREQUENCY 0 CASH_ADVANCE_FREQUENCY 0 CASH_ADVANCE_TRX 0 PURCHASES_TRX 0 CREDIT_LIMIT 1 PAYMENTS 0 MINIMUM_PAYMENTS 313 PRC_FULL_PAYMENT 0 TENURE 0 dtype: int64
Some initial considerations about the data:
CC_DF.describe()
| BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 | 8950.000000 |
| mean | 1564.474828 | 0.877271 | 1003.204834 | 592.437371 | 411.067645 | 978.871112 | 0.490351 | 0.202458 | 0.364437 | 0.135144 | 3.248827 | 14.709832 | 4494.282473 | 1733.143852 | 844.906767 | 0.153715 | 11.517318 |
| std | 2081.531879 | 0.236904 | 2136.634782 | 1659.887917 | 904.338115 | 2097.163877 | 0.401371 | 0.298336 | 0.397448 | 0.200121 | 6.824647 | 24.857649 | 3638.646702 | 2895.063757 | 2332.792322 | 0.292499 | 1.338331 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 50.000000 | 0.000000 | 0.019163 | 0.000000 | 6.000000 |
| 25% | 128.281915 | 0.888889 | 39.635000 | 0.000000 | 0.000000 | 0.000000 | 0.083333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1600.000000 | 383.276166 | 170.857654 | 0.000000 | 12.000000 |
| 50% | 873.385231 | 1.000000 | 361.280000 | 38.000000 | 89.000000 | 0.000000 | 0.500000 | 0.083333 | 0.166667 | 0.000000 | 0.000000 | 7.000000 | 3000.000000 | 856.901546 | 312.343947 | 0.000000 | 12.000000 |
| 75% | 2054.140036 | 1.000000 | 1110.130000 | 577.405000 | 468.637500 | 1113.821139 | 0.916667 | 0.300000 | 0.750000 | 0.222222 | 4.000000 | 17.000000 | 6500.000000 | 1901.134317 | 788.713501 | 0.142857 | 12.000000 |
| max | 19043.138560 | 1.000000 | 49039.570000 | 40761.250000 | 22500.000000 | 47137.211760 | 1.000000 | 1.000000 | 1.000000 | 1.500000 | 123.000000 | 358.000000 | 30000.000000 | 50721.483360 | 76406.207520 | 1.000000 | 12.000000 |
Next, I should do something about the outliers
In the next table, it can be seen that the maximum value for some of the variables are far from the mean. For instance, the max value for MINIMUM_PAYMENTS is located 32.4 standard deviations away from the mean. This clearly points to the presence of outliers in the data.
CC_DF_Z = []
for i in CC_DF.columns:
CC_DF_Z.append(((CC_DF.describe()[i][7]) - (CC_DF.describe()[i][1])) / CC_DF.describe()[i][2])
CC_DF_Z = pd.DataFrame(CC_DF_Z, columns = ['Standard deviations from the mean of MAX'])
CC_DF_Z.insert(0,'Variable',CC_DF.columns)
CC_DF_Z.sort_values(by = ['Standard deviations from the mean of MAX'], ascending = False)
| Variable | Standard deviations from the mean of MAX | |
|---|---|---|
| 14 | MINIMUM_PAYMENTS | 32.390925 |
| 4 | INSTALLMENTS_PURCHASES | 24.425524 |
| 3 | ONEOFF_PURCHASES | 24.199714 |
| 2 | PURCHASES | 22.482254 |
| 5 | CASH_ADVANCE | 22.009887 |
| 10 | CASH_ADVANCE_TRX | 17.546868 |
| 13 | PAYMENTS | 16.921334 |
| 11 | PURCHASES_TRX | 13.810243 |
| 0 | BALANCE | 8.397019 |
| 12 | CREDIT_LIMIT | 7.009671 |
| 9 | CASH_ADVANCE_FREQUENCY | 6.820140 |
| 15 | PRC_FULL_PAYMENT | 2.893291 |
| 7 | ONEOFF_PURCHASES_FREQUENCY | 2.673302 |
| 8 | PURCHASES_INSTALLMENTS_FREQUENCY | 1.599110 |
| 6 | PURCHASES_FREQUENCY | 1.269772 |
| 1 | BALANCE_FREQUENCY | 0.518055 |
| 16 | TENURE | 0.360659 |
I considered two methods for standarization of the data and dealing with outliers:
To have an idea of the results, let's take the variable MINIMUM_PAYMNETS as example:
table_1_CC_DF = [CC_DF_std, CC_DF_z]
table_1_CC_RS = [CC_RS_std, CC_RS_z]
table_1_CC_T = [CC_T_std, CC_T_z]
table_1 = pd.DataFrame([table_1_CC_DF, table_1_CC_RS, table_1_CC_T], columns = ['Standard deviation', 'Distance of Max to mean (stdv)'])
table_1.insert(0,'Method', ['Original Data', 'Robust Scaler', 'Power Transformation'])
table_1
| Method | Standard deviation | Distance of Max to mean (stdv) | |
|---|---|---|---|
| 0 | Original Data | 2332.792322 | 32.390925 |
| 1 | Robust Scaler | 3.775626 | 32.390925 |
| 2 | Power Transformation | 1.000056 | 4.036337 |
The table above shows that, for MINIMUM_PAYMENTS variable:
The following graph shows the effect of the power transofrmation on the distribution of the variable MINIMUM_PAYMENTS:
# Graph to compare original data woth transformed: one hist and one boxplot for each case
f, axs = plt.subplots(2,2, figsize = (10,6), sharex = False, gridspec_kw = dict(height_ratios = [1,3]))
sns.histplot(data = CC_DF, x = "MINIMUM_PAYMENTS", ax = axs[0,0], legend = False)
sns.boxplot(data = CC_DF, x = "MINIMUM_PAYMENTS", ax = axs[1,0])
sns.histplot(data = CC_T, x ="MINIMUM_PAYMENTS", ax = axs[0,1], legend = False)
sns.boxplot(data = CC_T, x = "MINIMUM_PAYMENTS", ax = axs[1,1])
f.tight_layout()
For the graph above, the left hand side represents the variable's original distribution, which ranges from 0 to 8,950, with a median of 312.3. The right hand side graph shows the distribution after the power transformation. Now, the variable ranges from -5.8 to 8.9, and its maximum value is 4.0 standard deviations away from its mean. We now have data more concentrated around its central values.
The following graph is the equivalent of the graph above, but for another variable: BALANCE. Similar conclusions can be drawn from it.
f, axs = plt.subplots(2,2, figsize = (10,6), sharex = False, gridspec_kw = dict(height_ratios = [1,3]))
# makde density plot along x-axis without legend
sns.histplot(data = CC_DF, x = "BALANCE", ax = axs[0,0],legend = False)
sns.boxplot(data = CC_DF, x = "BALANCE", ax = axs[1,0])
sns.histplot(data = CC_T, x ="BALANCE", ax = axs[0,1], legend = False)
sns.boxplot(data = CC_T, x = "BALANCE", ax = axs[1,1])
f.tight_layout()
As a final comment on the descriptive statitics of the data, following correlation heatmap shows some expected correlations, such as the ones between the variables levels, their frequencies (_FREQUENCY) and their average amount per transaction (TRX).
Interestingly enough, PRC_FULL_PAYMENT (how frequently the client pays its amount due totally) has negative correlations with some other variables such as:
sns.heatmap(CC_T.corr(), cmap= 'YlGnBu')
<AxesSubplot:>
Before PCA reduction, a Kmeans analysis on the transformed data was performed, but for the purposes of brevity, it is ommited in this presentation.
An alternative to improve the results of the analysis is to apply dimension reduction with Principal Component Analysis (PCA).
I run PCA setting the number of components to 8.
# Define optimal number of components
optimal_PC = 4
f, axs = plt.subplots(1,2, figsize = (8,4))
sns.lineplot(x = np.arange(1,9), y = PCA_1.explained_variance_ratio_, ax = axs[0])
sns.lineplot(x = np.arange(1,9), y = var_explained, ax = axs[1])
axs[0].set_title("Screetest")
axs[1].set_title("Accumulated variance explained")
axs[0].set_xlabel('Principal Components')
axs[1].set_xlabel('Principal Components')
axs[0].axvline(x = optimal_PC, color='r', linestyle=':')
axs[1].axhline(y = var_explained[(optimal_PC-1)], color='r', linestyle=':')
f.tight_layout()
For the analysis, I take only the first 4 components. Since:
The resulting loadings represent the relation between variables and components, and are useful to give meaning to our 4 components.
PCA_lv_1.iloc[:, : optimal_PC][(PCA_lv_1 > 0.28) | (PCA_lv_1 < -0.28)].fillna('-')
| V1 | V2 | V3 | V4 | |
|---|---|---|---|---|
| BALANCE | - | 0.433761 | - | - |
| BALANCE_FREQUENCY | - | 0.30332 | - | -0.356114 |
| PURCHASES | 0.353849 | - | - | - |
| ONEOFF_PURCHASES | - | - | 0.50492 | - |
| INSTALLMENTS_PURCHASES | 0.317172 | - | -0.389956 | - |
| CASH_ADVANCE | -0.299421 | - | - | 0.292591 |
| PURCHASES_FREQUENCY | 0.355422 | - | - | - |
| ONEOFF_PURCHASES_FREQUENCY | - | - | 0.470402 | - |
| PURCHASES_INSTALLMENTS_FREQUENCY | 0.309445 | - | -0.447063 | - |
| CASH_ADVANCE_FREQUENCY | -0.291577 | - | - | 0.298498 |
| CASH_ADVANCE_TRX | -0.294386 | - | - | 0.292329 |
| PURCHASES_TRX | 0.364443 | - | - | - |
| CREDIT_LIMIT | - | - | - | - |
| PAYMENTS | - | 0.321255 | - | 0.31056 |
| MINIMUM_PAYMENTS | - | 0.37198 | - | -0.290518 |
| PRC_FULL_PAYMENT | - | - | - | 0.512515 |
| TENURE | - | - | - | -0.296175 |
PC1: Intensiveness of credit card use, especially for big amounts in installments, and not for cash advance.
Total purchase amount in last year is high. The total amount spent in the last year is high, and the amount per transaction is also high. Purchases amount in installments during last year is also high Frequency of purchases overall and in installments is high, meaning most months has at least one purchase
PC2: Indebtedness. The purchases are not high, but the balance is.
High monthly average balance, they have accumulated debt. The number of months with balance is high, frequently having balance, so is having debt all moths High amount due and paid by the customer to reduce their balance total minimum payments due are also high, so payments are being made, but still has balance
PC3: Buying in one-off, no installments.
One off purchases amount are high, and installments purchases amount are low Not using card for installments, only for one off purchases Exact same relation but with the frequency
PC4: Pay card in full, and use it especially for cash in advance: being a relatively new client.
Paying card in total every month is a high feature in component Tenure is low, relatively new clients Also using credit card for asking cash advance. And not using card every month, negatively related to frequency
Having the four components interpreted, in the following graph we can see that maybe some concentrations of data (shadows) might become clusters in our clustering analysis
sns.pairplot(df_pca[['PC1', 'PC2', 'PC3', 'PC4']], kind = "scatter", plot_kws = dict(alpha = 0.03), height = 1.4)
<seaborn.axisgrid.PairGrid at 0x1679b2d60>
After analyzing different specifications for the model, the conclusion was to use a ward linkage KMeans classification model. Complete, single and average were tried: average and single were suffering serious problems, and complete and ward led similar results.
DENDROGRAM: Selected three cluster, to cut the dendrogram around 160.
CLUSTER_X = df_pca.iloc[:,:4]
C_model_5 = AgglomerativeClustering(distance_threshold = 0, n_clusters = None, linkage = 'ward')
C_model_5.fit_predict(CLUSTER_X)
plt.title("Hierarchical Clustering Dendrogram: Ward linkage")
plot_dendrogram(C_model_5, truncate_mode="level", p = 12)
plt.axhline(y = 160, color='r', linestyle=':')
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
I will use three clusters with ward linkage hierarchical clustering
classes = ['Cluster 0', 'Cluster 1', 'Cluster 2']
fig, ax = plt.subplots(1, 3, figsize=(14, 6))
scatter_0 = ax[0].scatter(CLUSTER_X['PC1'], CLUSTER_X['PC2'], c = C_model_10.labels_, alpha = .8, cmap = 'RdYlGn')
ax[0].set_xlabel("Intensiveness, use installments", fontsize = 16)
ax[0].set_ylabel("Indebtedness", fontsize = 16)
ax[0].legend(handles = scatter_0.legend_elements()[0], labels = classes)
scatter_1 = ax[1].scatter(CLUSTER_X['PC1'], CLUSTER_X['PC3'], c = C_model_10.labels_, alpha = 0.8, cmap = 'RdYlGn')
ax[1].set_xlabel("Intensiveness, use installments", fontsize = 16)
ax[1].set_ylabel("Purchases one-off", fontsize = 16)
ax[1].legend(handles = scatter_1.legend_elements()[0], labels = classes)
scatter_2 = ax[2].scatter(CLUSTER_X['PC1'], CLUSTER_X['PC4'], c = C_model_10.labels_, alpha = 0.8, cmap = 'RdYlGn')
ax[2].set_xlabel("Intensiveness, use installments", fontsize = 16)
ax[2].set_ylabel("Pay in full, use cash in advance & new clients", fontsize = 16)
ax[2].legend(handles = scatter_2.legend_elements()[0], labels = classes)
plt.tight_layout()
plt.show()
fig, ax = plt.subplots(1, 2, figsize=(10, 6))
scatter_0 = ax[0].scatter(CLUSTER_X['PC2'], CLUSTER_X['PC3'], c = C_model_10.labels_, alpha = 0.8, cmap = 'RdYlGn')
ax[0].set_xlabel("Indebtedness", fontsize = 16)
ax[0].set_ylabel("Purchases one-off", fontsize = 16)
ax[0].legend(handles = scatter_0.legend_elements()[0], labels = classes)
scatter_1 = ax[1].scatter(CLUSTER_X['PC2'], CLUSTER_X['PC4'], c = C_model_10.labels_, alpha = 0.8, cmap = 'RdYlGn')
ax[1].set_xlabel("Indebtedness", fontsize = 16)
ax[1].set_ylabel("Pay in full, use cash in advance & new clients", fontsize = 16)
ax[1].legend(handles = scatter_1.legend_elements()[0], labels = classes)
plt.tight_layout()
plt.show()
From this second set of graphs that shows components 2, 3 and 4, conclusions are a bit more fuzzy or unclear, because clusters 0 and 1 (blue and green) show overlapping. But we can still see:
From the information of the graphs, and remembering our definition of the Components, we can try to define our clients profile for very cluster:
Cluster 0
Cluster 1
Cluster 2
Having described the clusters to the extent that the results of the model allowed from the graphs, I go back to the original data set to try to corroborate or rectify insights about the clients of each cluster.
CC_DF['CLUSTER'] = C_model_10.labels_
# selected_vars = ['BALANCE', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT', 'TENURE', 'CLUSTER']
# mean_complete = CC_DF[selected_vars].mean()
mean_complete = CC_DF.mean()
mean_complete = pd.DataFrame(mean_complete)
mean_complete.drop('CLUSTER', axis = 0, inplace = True)
# mean_grouped = CC_DF[selected_vars].groupby('CLUSTER').mean()
mean_grouped = CC_DF.groupby('CLUSTER').mean()
final_table = pd.concat([mean_complete.T, mean_grouped], ignore_index = True, axis = 0)
final_table.rename(index = {0 : 'Complete data', 1 : 'Cluster 0', 2 : 'Cluster 1', 3 : 'Cluster 2'}, inplace = True)
final_table
| BALANCE | BALANCE_FREQUENCY | PURCHASES | ONEOFF_PURCHASES | INSTALLMENTS_PURCHASES | CASH_ADVANCE | PURCHASES_FREQUENCY | ONEOFF_PURCHASES_FREQUENCY | PURCHASES_INSTALLMENTS_FREQUENCY | CASH_ADVANCE_FREQUENCY | CASH_ADVANCE_TRX | PURCHASES_TRX | CREDIT_LIMIT | PAYMENTS | MINIMUM_PAYMENTS | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Complete data | 1564.474828 | 0.877271 | 1003.204834 | 592.437371 | 411.067645 | 978.871112 | 0.490351 | 0.202458 | 0.364437 | 0.135144 | 3.248827 | 14.709832 | 4494.282473 | 1733.143852 | 844.906767 | 0.153715 | 11.517318 |
| Cluster 0 | 933.706163 | 0.849932 | 1249.809328 | 695.620828 | 554.629137 | 184.732268 | 0.666572 | 0.234626 | 0.517996 | 0.025218 | 0.588490 | 18.815924 | 4304.695666 | 1388.925592 | 671.797045 | 0.228381 | 11.605834 |
| Cluster 1 | 2131.115470 | 0.882550 | 0.883766 | 0.753361 | 0.172329 | 1974.015044 | 0.001715 | 0.001300 | 0.000496 | 0.273109 | 6.288332 | 0.020251 | 3982.240437 | 1634.708858 | 992.030606 | 0.045101 | 11.305689 |
| Cluster 2 | 2688.396865 | 0.948175 | 1462.440511 | 982.892297 | 479.750150 | 2069.621381 | 0.556545 | 0.343402 | 0.350929 | 0.285881 | 7.241398 | 20.054939 | 5617.444870 | 2815.673059 | 1163.011032 | 0.068479 | 11.511654 |
Comparing the conclusions drawn frome the graph with the numbers of the table, it can be stated that most of them hold.
There are, however, a couple of observations:
Cluster 1. Most data points are concentrated around the middle value of one-off purchases (component 3): this would lead to the conclusion that those clients buy in one off purchases, but this is not corroborated in the table. Nevertheless, for other variables that are also related to component 3, the values of the table are more in line to what we would have expected: clients in this cluster do not use their credit cards for purchases in installments, and have also a low frequency of purchases, which is a variable whose loading is negative (-.2) with respect to component 3.
About cluster 2. After reviewing the table we can confirm that those clients fall within the middle-upper part of component 4 (Pay in full, use cash in advance & new clients) because they use their credit cards intensively for cash in advance, and they have a relatively low frequency of use.
Examples of tailored strategies
Some examples of the kind of targeted promotions could be:
For clients that belong to cluster 0: frequently offer them the option to pay they purchases in interest-free installments if the amount of the purchase exceedes certain threshold.
For clients in cluster 1: motivate them to use their credit card more for purchases by offering them lower rate or lower fees in their cash in advance operations once they reach a certain amount of monthly purchases.
For clients in cluster 2: the bank could incentivice this customers to use their cards more intensively, either by offering them lower fees in their cash in aadvance operations, or by increasing their credit limit.
The original data set can be downloaded from: https://www.kaggle.com/arjunbhasin2013/ccdata
The following material online was consulted:
About Standarization with outliers and power transformation
Dendrograms:
About PCA
Some materials from labs and homeworks from class 95791 Data Mining (Heinz College, Carnegie Mellon University), taught by Dr. Gabriela Gongora Swartzman were also used.