,

DATA SCIENCE SERIES

Supermarket sales | Business analysis

Univariate, Bi-variate, and Multivariate analysis using IBM SPSS Modeler

--

Howdy! 👋
I’m a UX designer, who got interested in Business analysis using data mining techniques. I hope you find this article insightful 😉

It was inspired by this amazing article by Irfan Fadhullah

Dataset 🛒

Find this data set on Kaggle 🤘

Dataset context

The growth of supermarkets in most populated cities are increasing and market competitions are also high. The dataset is one of the historical sales of supermarket company which has recorded in 3 different branches for 3 months data.

Attribute information

  1. Invoice id: Computer generated identification number
  2. Branch: Branch of supermarket (A, B, C)
  3. City: Location of supermarket (Yangon, Naypyitaw, Mandalay)
  4. Customer type: Type of customers (Normal, Member)
  5. Gender: Gender of the customers (Male, Female)
  6. Product line: Product categories (Food and beverages, Health and beauty, electronic accessories, Fashion accessories, Home and lifestyle, Sports and travel)
  7. Unit price: Price of products in US dollars
  8. Quantity: Number of products purchased by each customer
  9. Tax: 5% tax fee on top of purchase amount
  10. Total: Total price of the purchase including tax
  11. Date: Date of purchase (January 2019 to March 2019)
  12. Time: Time of purchase (10am to 9pm)
  13. Payment: Payment methods (E-wallet, Cash, Credit card)
  14. COGS: Cost of Goods Sold
  15. Gross margin percentage: Gross margin percentage of the purchase
  16. Rating: Customer stratification rating from each customer (From 1 to 10)

Business understanding 🕴

Client

Supermarket chain in Myanmar

Business goal

To improve the gross income of the supermarket C (Naypyitaw)

Business questions

  1. Segment the “high profit” and “low profit” customers to devise a customized marketing strategy for boosting the sales of supermarket C.
  2. How’s gross income affected by unit price, quantity, and other variables like day, timeslot, and product line in general?
  3. How’s the unit price affected by gross income, quantity, and other variables like day, timeslot, and product line?
  4. Find and explain the relationship between gender and product line, payment, gross income for Branch C?
  5. Find and explain the relationship between customer type and gender, day, timeslot for Branch C?
  6. Predict the day of future purchases of the customers.

Data mining goal

  1. Create a clustering model to segment the high profit and low profit customers of supermarket C.
  2. Create a model to analyze the relationship between different variables of the supermarket.
  3. Create a prediction model to predict the day of future purchase.

Data understanding 🤔

Field type analysis

It is evident that Invoice ID is detected as “typeless” since it has hyphens in between which could not be detected as a valid measurement type. Whereas, fields like branch, city, product line and payment are “nominal” that have multiple values. The customer type and gender are a binary field, and thus has “flag” measurement. And the rest of the variables are “continuous” since they have a range of numeric values.

Record count

Let’s go through briefly to analyze the number of records and individual field count to understand the overall state of the dataset. I’ve used aggregate and table nodes to find the specific record count.

This dataset contains 1000 records with individual branch distribution being 340 Branch A records, 328 Branch B records, and 332 Branch C records. The customer type and gender are equally distributed with 501 Member and 499 Normal customer records, 499 Male and 501 Female records. The product line has a wide distribution with 170 Electronic accessories, 178 Fashion accessories, 174 Food and beverages, 152 Health and beauty, 160 Home and lifestyle, and 166 Sports and travel records. The payment method is distributed into 3 categories with 345 e-wallet, 344 cash, and 311 credit card records.

Uni-variate analysis 🧐

3.3.1 Is there any difference in aggregate sales across all the branches?

A distribution graph node was used to find the frequency of the categorical field “Branch”. There is no much of a difference in sales across all the branches A,B, and C. But, the sales in Branch A is a bit higher than the rest of the branches. Here’s the numeric data of the branch sales,

a. Branch A — 340

b. Branch B — 332

c. Branch C — 328

3.3.2 Which is the most popular payment method that is used by the customers?

A distribution graph node was used to find the frequency of the categorical field “Payment”. The E-wallet payment method is the most popular one, but cash is also very popular. The least used payment method is the credit card. Here’s the numeric data of the payment method used,

a. Cash — 344

b. Credit card — 311

c. Ewallet — 345

3.3.3 Which is the most popular product line across all the branches?

A distribution graph node was used to find the frequency of the categorical field “Product line”. The most popular product line is the Fashion accessories, and the second most popular product line is the Food and beverages. Here’s the numeric data of different product lines (purchase counts),

a. Electronic accessories — 170 b. Fashion accessories — 178 c. Food and beverages — 174 d. Health and beauty — 152 e. Home and lifestyle — 160 f. Sports and travel — 166

3.3.4 Which gender makes the greater number of purchases?

A distribution graph node was used to find the frequency of the flag field “Gender”. It is evident that people from both the genders are making almost same number of purchases. But females tend to make more than male when we look more precisely. Here’s the numeric data of purchases made based on gender data,

a. Male — 499

b. Female — 501

Bi-variate analysis 🤓

3.4.1 Does gross income affect the ratings that the customers provide?

A scatter plot node was used to generate a scatter plot with a trend line between gross income and rating. Since the trend line is quite flat, we can conclude that gross income does not have a significant effect on customer stratification rating.

3.4.2 Which customer type is more profitable when customer stratification rating is considered?

The plot node was used to visualize the data in the form of stacked histogram. It is visually evident that customer who belong to Member category purchase more than the Normal customers. Thus, Members provide higher gross income compared to normal customers.

And, the customers who belong to customer stratification rating range of 6 to 8 purchase more than the other rating ranges.

Here’s the numeric data of gross income from both the customer types,

Member = $7820.164

Normal = $7559.205

Here’s the numeric data of gross income from customers who are rated between 6 to 8,

Gross income = $5564.311

3.4.3 Which is the most profitable branch?

The graphboard node was used to plot the graph between branch and gross income. It is visually evident that the Branch C is the most profitable branch among all the three. Whereas, Branch A and B are equal in profits. Here’s the numeric data of gross income from all the branches,

a. Branch A = $5,057.2

b. Branch B = $5,057.2

c. Branch C = $5,265.2

3.4.4 Which gender brings more profit to the supermarket?

The graphboard node was used to visualize this graph between gender and gross income. It is visually evident that Females majorly contribute to the supermarket’s profit. Here’s the numeric data for gender-wise gross income,

a. Male = $7,384.9

b. Female = $7,994.4

3.4.5 Which is the most profitable product line?

The graphboard node was used to plot the graph between gross income and product line. It is evident that the Food and beverages category is the most profitable one, and the Sports and travel is the second most profitable one. Here’s the numeric gross income data according to each product line,

a. Electronic accessories — $2,587.5 b. Fashion accessories — $2,586 c. Food and beverages — $2,673.6 d. Health and beauty — $2,342.6 e. Home and lifestyle — $2,564.9 f. Sports and travel — $2,624.9

Correlation analysis 📉

3.5.1 How does other fields affect the gross income of the supermarket?

Statistics and graphboard node were used to find the correlation between these fields. It is very evident the gross income is strongly correlated with unit price, quantity, and total bill amount. An interesting insight is the customer stratification rating is not correlated with gross income. So, we can conclude the customers are not rated based on the profits they contribute to the supermarket or any other fields.

Data preparation 👩‍🍳

Outlier detection

No outliers were detected while using the Data Audit node.

Data cleaning

4.2.1 Remove redundant field: Tax 5%, City

As shown in the table, both of these fields are redundant. So, I decided to filter out “Tax 5%” field while using the dataset.

Since city and branch gives us almost the same information, so I decided to filter out “city” field while using the dataset.

4.2.2 Fixed the date formatting

As shown in the table, the date is not formatted in a single date style (either the US or UK). And, the separator is also mixed (hyphen and slash). So, I used “Text to columns” option in Excel to fix the date formatting issue and the separator style.

4.2.3 Create new field: day

As shown in the table, a new field called ‘Day’ was created to have the day of the respective date for generating better insights from the dataset. The ‘TEXT’ function in excel was used to convert dates into days.

4.2.4 Create new field: timeslot

As shown in the table, a new field called “Timeslot” was created to have the timeslot (morning, afternoon, evening, and night) for generating better insights from the dataset. Using the ‘Derive’ node and set the multiple nominal rules to segment the data into 4 different categories.

4.2.5 Removal of unwanted field: Invoice ID, Gross margin percentage

Just to have the required fields for further data modelling, I decided to filter the Invoice ID field. Also, the Invoice ID has the measurement ‘Typeless’ which does not add any significant impact on further steps.

Since the gross margin percentage is the same ‘4.762%’ it does not have any impact for my business questions.

Additional analysis 📊

After I created new variables, I would like to analyze and know some answers for some of the questions. This can provide a different perspective towards the business goal.

5.1 Which day of the week is highly profitable in gross income?

The graphboard node was used to plot the bar chart of gross income vs day. It is very evident that the highest profitable day during the week is ‘Saturday’ followed by ‘Tuesday’. And, Sunday, Thursday, and Wednesday performs almost similarly. Here’s the numeric data of gross income per each day,

a. Monday = $1,804.7

b. Tuesday =$2,451.5

c. Wednesday =$2,082.4

d. Thursday = $2,159.5

e. Friday = $2,091.7

f. Saturday = $2,672.4

g. Sunday = $2,117.0

5.2 Which timeslot of the day is most profitable in gross income?

The graphboard node was used to plot the bar chart between gross income and timeslot. It is evident the profits are higher during the afternoons, followed by morning timeslot. Here’s the numeric data of gross income per each timeslot,

a. Morning = $2,942.8

b. Afternoon = $7,048.7

c. Evening = $2,4414.

d. Night = $2,946.4

5.3 Which day is highly profitable in gross income in the morning, afternoon, evening, and night?

The graphboard node was used to visualize the box plot of gross income vs day & time slot. Below is the detailed numeric data of gross income,

During the mornings,

  1. Sunday = $19.057
  2. Wednesday = $17.711

During the afternoon,

  1. Tuesday = $18.054

During the evening,

  1. Saturday = $18.744

During the night,

  1. Thursday = $18.358
  2. Friday = $18.283

Thus, supermarket can take advantage of this trend and provide the best experience to its customers. Here’s the numeric

Modelling 📳

Clustering

5.1.1 Segment the “high profit” and “low profit” customers to devise a customized marketing strategy for supermarket C along with additional characteristics.

Auto Cluster node was used to quickly find the best cluster having the best silhouette value using 2 different clustering methods (K-means and two-step).

Cluster analysis helps to segment different customers with their unique characteristics that can help the supermarket to devise a customized marketing strategy.

K-means & two step clustering

A select node was used to select the data only from branch C, and a partition node was used to split the data into 80–20 to avoid the over-fitting problems. Since the data has not labeled the customers into high profit and low profit, also the customer stratification rating is not directly related to the gross income, we have to cluster the dataset based on the following features,

  1. Customer type
  2. Product line
  3. Day
  4. Timeslot
  5. Payment
  6. Gross income

Considering the best silhouette value, I plotted the graph between number of clusters and their respective silhouette value to select the best model,

After analyzing the silhouette value and the cluster quality for both the clustering methods, I have decided to go for K-means clustering consisting 6 clusters with the second highest silhouette value of 0.338.

As we can see the cluster quality is almost fair with no potential overfitting. Now, let us see all the 6 clusters formed using the K-means algorithm.

Describing the high profit cluster,

  1. Size: It is the second largest cluster (18.5%) among all the 6 clusters
  2. Preferred payment: Ewallet by everyone (100%)
  3. Customer type: Normal (100%)
  4. Gross income: $17.56 (higher than the median of $12.925)
  5. Frequent day: Thursday (26.1%) followed by Tuesday and Saturday
  6. Frequent timeslot: Afternoon (60.9%)
  7. Frequently purchased product line: Fashion accessories (23.9%) followed by Food and beverages, Health and beauty

Inference: It is surprising that the normal customers are generating the highest profits (gross income) when compared to member customers. They buy mostly on Thursday afternoons using their E-wallet for payments. Also, they mostly purchase products from Fashion accessories.

Describing the low profit cluster,

  1. Size: It is the smallest cluster (11.6%) among all the 6 clusters
  2. Preferred payment: Credit card by everyone (100%)
  3. Customer type: Normal (100%)
  4. Gross income: $12.28 (lower than the median of $12.925)
  5. Frequent day: Wednesday (24.1%) followed by Tuesday, Thursday, and Saturday
  6. Frequent timeslot: Afternoon (44.8%)
  7. Frequently purchased product line: Fashion accessories (24.1%) followed by Health and beauty

Inference: It is evident that the normal customers belong to this cluster with the least gross income of $12.28. They mostly purchase during the Wednesday afternoons with their credit card to buy products from fashion accessories.

Linear regression

Linear regression was used to find linear relationships between different continuous variables. I tried to have multiple variations with different target variables and different splits, but no significant difference in the R2 score was found.

5.2.1 How’s gross income affected by unit price, quantity, and other variables like, day, timeslot, and product line.

Since the question also asks about the variables which are nominal, I had to convert them to flag based variables so that I could use them in the linear regression modelling. Here are the variables involved in the modeling,

Input variables,

  1. Unit price — Continuous
  2. Quantity — Continuous
  3. Day — Nominal → Flag
  4. Timeslot — Nominal → Flag
  5. Product line — Nominal → Flag

Target,

  1. Gross income — Continuous

The SetToFlag node was used to convert the nominal to flag,

It is evident that the R2 score remains the same even if the partition split changes,

We can conclude the R2 score being substantial (0.892), and the data fit the model quite well. Also, the statistical significance is 0.000 so the results are significant.

Here’s the findings from regression model,

It is very conclusive from the coefficients that quantity (2.796) being the most impactful variable on the gross income of the supermarket. Also, the unit price (0.2772) with an interesting insight of Tuesday’s sales (0.2211) with payment made via credit card (0.519) has also had a positive impact. Finally, the Home and lifestyle products (0.3098) contribute the most and positively to the supermarket’s gross income.

5.2.2 How’s rating affected by unit price, quantity, and gross income?

It was evident with the R2 score of 0.004 is the lowest and is bad for any conclusions to draw from the model.

5.2.3 How’s the unit price affected by gross income, quantity, and other variables like day, timeslot, and product line.

Since the question also asks about the variables which are nominal, I had to convert them to flag based variables so that I could use them in the linear regression modelling. Here are the variables involved in the modeling,

Input variables,

  1. Gross income — Continuous
  2. Quantity — Continuous
  3. Day — Nominal → Flag
  4. Timeslot — Nominal → Flag
  5. Product line — Nominal → Flag

Target,

  1. Unit price — Continuous

The SetToFlag node was used to convert the nominal to flag,

It is evident that the R2 score remains the same even if the partition split changes,

We can conclude the R2 score being substantial (0.785), and the data fits the model quite well. Also, the statistical significance is 0.000 so the results are significant.

Here’s the findings from regression model,

It is conclusive that the gross income (2.823) has the highest impact on the unit price. Also, Monday (2.372), Morning (1.767), Cash payment (0.1373) has a positive relationship with the unit price.

Logistic regression

Logistic regression was used to find the relationship between variables and the impact of one on the other variable. In general, I tried to answer the below mentioned in 4 different ways to find the best accuracy. Here are the four combinations of training & testing the dataset to achieve the best accuracy,

  1. Full dataset with 80–20 split
  2. Full dataset with 75–25 split
  3. Dataset of only Branch C with 80–20 split
  4. Dataset of only Branch C with 75–25 split

Here’s the Full dataset version,

Here’s the Branch C version,

5.3.1 Find and explain the relationship between gender and product line, payment, gross income for Branch C.

Let’s first see the accuracies of all the four versions,

a. Full dataset with 80–20 split — 52.4%

b. Full dataset with 75–25 split — 55.82%

c. Dataset of only Branch C with 80–20 split — 41.77%

d. Dataset of only Branch C with 75–25 split — 41.3%

As we know from the previous data, females contribute more towards supermarket’s gross income than males. So, I decided to set the base category for target as females, and get more insights on male.

Some of the insights from the regression equation are,

  1. (-0.1909) Male customers contribute the least to gross income while spending with cash
  2. (0.8547) Male customers also contribute more to gross income while spending on Home and lifestyle products
  3. (0.1328) Male customers mostly purchase Home and lifestyle products globally, and Fashion accessories paid via cash
  4. (-0.8603) Similarly, electronic accessories are least or never purchased by male customers via cash
  5. (-0.2663) Male customers do not purchase Home and lifestyle products via credit card

5.3.2 Find and explain the relationship between customer type and gender, day, timeslot for Branch C.

Let’s first see the accuracies of all the four versions,

a. Full dataset with 80–20 split — 46.77%

b. Full dataset with 75–25 split — 51.81%

c. Dataset of only Branch C with 80–20 split — 51.9%

d. Dataset of only Branch C with 75–25 split — 53.26%

Finding the potential normal customers to convert them to members could be beneficial to the supermarket, and thus I have set the members as base category.

Some of the insights from the regression equation are,

  1. (0.5286) Normal customers do their shopping on Monday are the highest
  2. (0.4463) They prefer mornings to do their shopping
  3. (0.7156) Normal customers who are female and shop during Thursday are the highest
  4. (1.673) Normal customers who are female and shop during Tuesday afternoon are the highest
  5. (-1.413) Normal customers who shop on Monday mornings are the least
  6. (-0.7133) Normal customers who are female and shop on Friday are the least

5.3.3 Find and explain the relationship product line between branch, customer type, gender, day for Branch C.

Let’s first see the accuracies of all the four versions,

a. Full dataset with 80–20 split — 17.91%

b. Full dataset with 75–25 split — 16.47%

c. Dataset of only Branch C with 80–20 split — 12.66%

d. Dataset of only Branch C with 75–25 split — 10.87%

Since the Health and beauty are the worst performing product line in terms of gross income from the previous data, I decided to focus on that.

Some of the insights from the regression equation are,

  1. (0.2857) Health and beauty product line perform better at Branch B
  2. (0.4387) Health and beauty product line perform better on Friday
  3. (0.7856) Health and beauty product line perform better on Sunday in Branch A
  4. (1.714) Health and beauty product line perform better on Friday in Branch B
  5. (1.191) Health and beauty product line perform better on Sunday’s with Member customers
  6. (-1.828) Health and beauty product line perform the worst on Sunday’s by Female

To summarize, the 75–25 split had better accuracy rates but with slightly different input datasets. But, each question is answered with higher accuracy with different data combinations.

Classification: Nearest Neighbor

Due to pre-classified data of product lines and fewer continuous inputs I wasn’t able to find any conclusions or insights. Also, this was not a classification problem. So, I found this modeling method not applicable to my dataset.

Prediction: Neural networks & Random forests

I chose two algorithms to predict the day of purchase while taking some of the features from the dataset. To get a different perspective, I converted the product line nominal variable to a continuous number-based variable. Here’s how I converted the nominal to continuous,

First, I selected the field I wanted to change and assigned new values to each option.

And, since these new values were in string format, I had to change them to number format.

Now, let’s explore the inputs and target used for training both the models,

Inputs,

  1. Quantity — Continuous
  2. Unit price — Continuous
  3. Total — Continuous
  4. Product line (R) — Continuous

Target,

  1. Day — Nominal

Let’s now see how both the models perform with the same inputs and target of the same dataset,

Neural networks

The model did not perform really well. It was not able to detect the days accurately when compared to the actual data. This can be concluded with a poor Correct % of just 13.41%.

Random Forest

The model performed really well being able to predict very the days very accurately when compared with the real data. The Correct % was 96.34%.

To conclude, Random Forest was the best model to predict the day of purchase in the given dataset.

Prediction: Decision trees (C 5.0, CHAID)

I chose two algorithms to predict the day of purchase while taking some of the features from the dataset. To get a different perspective, I converted the product line nominal variable to a continuous number-based variable. Here’s how I converted the nominal to continuous,

First, I selected the field I wanted to change and assigned new values to each option.

And, since these new values were in string format, I had to change them to number format.

Now, let’s explore the inputs and target used for training both the models,

Inputs,

  1. Quantity — Continuous
  2. Unit price — Continuous
  3. Total — Continuous
  4. Product line (R) — Continuous

Target,

  1. Day — Nominal

Let’s now see how both the models perform with the same inputs and target of the same dataset,

C 5.0 & CHAID node was used to predict the day of purchase

I’ve used both models in a direct fashion to get the individual prediction. Let’s compare the correct percentage from each model and different data split percentage,

C 5.0 (80–20 split) The correct percentage & equation,

CHAID (80–20 split) The correct percentage & equation,

C 5.0 (75–25 split) The correct percentage & equation,

CHAID (75–25 split) The correct percentage & equation,

Until now, it is conclusive that these models are not better than the random forest model. Let’s try a different combination with the introduction of logistic regression node with the same split percentages,

C 5.0 + CHAID + Logistic regression (80–20 split),

C 5.0 + CHAID + Logistic regression (75–25 split),

So, we can conclude again that random forest was the best model for predicting the day of purchase.

Evaluation 🙋‍♂️

Question 1: Segment the “high profit” and “low profit” customers to devise a customized marketing strategy for boosting the sales of supermarket C.

Results: The low-profit customer cluster prefers credit card payments, and everyone is normal customers. They mostly purchase on Wednesday afternoons and mostly buy Fashion accessories.

Business action: Provide cashback and discount offers for purchases made via credit cards, and promote the supermarket memberships to the normal customers.

Question 2: How’s gross income affected by unit price, quantity, and other variables like day, timeslot, and product line in general?

Results: It is very conclusive from the coefficients that quantity (2.796) being the most impactful variable on the gross income of the supermarket. Also, the unit price (0.2772) with an interesting insight of Tuesday’s sales (0.2211) with payment made via credit card (0.519) has also had a positive impact. Finally, the Home and lifestyle products (0.3098) contribute the most and positively to the supermarket’s gross income.

Business action: Promoting the sales of Home and lifestyle products through multiple communication channels, and exclusive Tuesday sales can help the supermarket improve its profits. Finally, selling high-priced products can really boost gross income.

Question 3: How’s the unit price affected by gross income, quantity, and other variables like, day, timeslot, and product line?

Results: It is conclusive that the gross income (2.823) has the highest impact on the unit price. Also, Monday (2.372), Morning (1.767), Cash payment (0.1373) has a positive relationship with the unit price.

Business action: Having cross-selling campaigns and exclusive combo offers for monday mornings on high priced products to boost the gross income of the supermarket.

Question 4: Find and explain the relationship between gender and product line, payment, gross income for Branch C.

**Results: (**0.1909) Male customers contribute the least to gross income while spending with cash. (0.8547) Male customers also contribute more to gross income while spending on Home and lifestyle products. (0.1328) Male customers mostly purchase Home and lifestyle products globally, and Fashion accessories paid via cash (0.8603). Similarly, electronic accessories are least or never purchased by male customers via cash (0.2663). Male customers do not purchase Home and lifestyle products via credit card.

Business action: Promoting the offers and discounts on male products, and bundling them with top-selling female products to increase the overall profits

Question 5 & 6: Find and explain the relationship between customer type and gender, day, timeslot for Branch C? Predict the day of future purchases of the customers.

**Results: (**0.5286) Normal customers who do their shopping on Monday are the highest. (0.4463) They prefer mornings to do their shopping (0.7156) Normal customers who are female and shop during Thursday are the highest. (1.673) Normal customers who are female and shop during Tuesday afternoon are the highest. (1.413) Normal customers who shop on Monday mornings are the least. (0.7133) Normal customers who are female and shop on Friday are the least.

Business action: Having exclusive member discounts on Monday, Thursday, and Tuesday can help increase the member subscription revenue. Plus, it will also increase product sales.

I have summarized my whole process, and I hope it helped you!

Buy me a beer 🍺

You can support me to write more of such articles here

You can clap 50 times 👏, did you know?

Click and hold the claps icon until you reach 50

Hope this helps you to gear-up! Feel free to reach out to me on,
👨‍💼 LinkedIn | 💌 Email | 🐤 Twitter

--

--

Somesh Kesarla Suresh

Product Designer who loves Problem-solving 🧩 Catch-up my Design articles on www.medium.com/@someshdesign and connect with me on www.linkedin.com/in/someshks