luni, 16 decembrie 2024

VPDA - Mall Customers Data Analysis

Introduction

Exploring a dataset of mall customers can be important because it can uncover patterns in spending habits, help identify distinct customer segments, and guide data-driven marketing strategies. The Mall Customers dataset from Kaggle provides demographic details (Age, Gender), Annual Income data, and a Spending Score metric for 200 individuals. 

Data Overview
The Mall Customers dataset includes 200 records, each with a CustomerID, Gender, Age, Annual Income (in thousands of dollars), and a Spending Score from 1 to 100. There are no missing values. The mean Age is approximately 38.85, and the average Annual Income is near 60.56k per year. The Spending Score, which averages around 50.2, represents an internal metric assigned by the mall. Distributions of the features show that Ages are spread roughly between 20 and 70, Annual Incomes are concentrated between 30k and 80k with a few higher outliers, and Spending Scores cluster around the mid-range without a clear linear relationship to the other features.

Pairwise plots of Age, Annual Income, and Spending Score reveal no straightforward correlations. Younger customers do not necessarily spend more, and higher incomes do not guarantee higher Spending Scores. 



Hierarchical Clustering
To discover natural groupings, hierarchical clustering was applied to the scaled Age, Annual Income, and Spending Score features. Choosing five clusters divided the 200 customers into groups of various sizes (66, 45, 39, 28, and 22 members). When plotting these clusters by Annual Income and Spending Score, visually distinct segments appear. Some groups trend toward moderate incomes and mid-level spending, while others represent higher-income customers with a wide range of spending patterns. These clusters, formed without predefined labels, highlight inherent segmentation in the customer base.



Dimensionality Reduction with PCA and t-SNE
Principal Component Analysis (PCA) provides a way to visualize complex, high-dimensional data on a two-dimensional plane. After applying PCA, the previously discovered clusters spread out across the principal components, confirming that the chosen features capture meaningful differences in customer behavior.


A further step—t-SNE (t-Distributed Stochastic Neighbor Embedding)—offers a nonlinear dimensionality reduction that often reveals clearer separations. The t-SNE plots show well-defined, tight groupings of points. Each cluster occupies a distinct region, reinforcing the idea that the hierarchical clustering discovered natural, data-driven segments. For instance, one cluster is tightly grouped far from the others, indicating a unique profile of customers that differ markedly from their peers.



Classification with a Decision Tree
While the dataset does not provide a direct classification target, the Spending Score can be used to define one. Labeling customers as “High Spenders” if their Spending Score is above 50 creates a binary classification problem. A decision tree was trained using Age, Annual Income, and Gender as inputs. The resulting classification report shows a balanced performance, with macro averages of around 0.72 for precision and recall. The confusion matrix indicates that both high and low spenders are identified reasonably well, though some misclassifications occur.

Examining feature importances reveals that Age is the most critical predictor of high-spending behavior (importance ~0.5378), followed by Annual Income (~0.4116), while Gender contributes minimally (~0.0505). These findings suggest that age and income brackets may offer a more reliable way to anticipate higher spending patterns than demographic factors like gender.






Insights and Applications
The combination of clustering, dimensionality reduction, and classification techniques provides a comprehensive overview of the customer landscape. Unsupervised methods expose distinct market segments, while dimensionality reduction confirms these clusters visually, making it easier to convey the findings. The decision tree model adds another layer of value by highlighting which attributes most strongly influence spending behavior.

From a practical standpoint, these insights enable more targeted marketing strategies. For example, if one cluster consists primarily of younger, moderate-income individuals with high Spending Scores, tailored loyalty programs or special promotions could resonate strongly with that segment. Similarly, identifying older customers who consistently appear in high-spending clusters may prompt personalized product suggestions or event invitations that match their interests and influence their future spending decisions.

Conclusion
The Mall Customers dataset, when explored with a combination of hierarchical clustering, PCA, t-SNE, and decision tree modeling, reveals nuanced patterns and valuable segments. These techniques highlight how different features interact to shape spending behavior, making it possible to identify unique customer groups and predict which individuals might respond best to certain marketing initiatives. The result is a data-driven approach to understanding mall clientele, ultimately guiding more informed and effective business decisions.

Niciun comentariu:

Trimiteți un comentariu

VPDA - Mall Customers Data Analysis

Introduction Exploring a dataset of mall customers can be important because it can uncover patterns in spending habits, help identify distin...