In today's dynamic retail environment, customers are offered with a tremendous range of choices and their loyalty is increasingly becoming transitory due to the severe impact of competitors' actions on existing relationships (Reinartz and Kumar, 2000). This increased competition to satisfy the diverse needs of the customer, forces the traditional production and selling focus of the retailers towards customer relationships.
In the context of retail supermarket, this has resulted in large investments in retail information systems to collect the shoppers' data to understand the customer shopping behaviour (Brijs.T et al 2001). Several tools and technologies of data warehousing, data mining, and other customer relationship management (CRM) techniques are exploited to manage and analyse this data. Especially through data mining, simply means extracting knowledge from large amounts of data which helps the organisations to find the patterns and trends in their customers' data, and then to drive improved customer relationships (Rygielski, Wang and Yen, 2002).
According to Witten & Frank, (2005), some data mining techniques include decision trees (DT), artificial neural networks (ANN), genetic algorithms (GA), association rules (AR), etc., are usually used to solve problems related with customers in various fields like engineering, science, finance and business. In retail supermarket domain, data mining can be applied to identify useful customer behaviour patterns from large amounts of customer and transaction data (Giudici & Passerone, 2002). Consequently, the discovered information can be used to support better decision-making in retail marketing. Data mining techniques have been mostly adopted to make predictions and describe behaviours.
During the past decade, there has been an array of significant developments in data mining techniques. Some of these developments are implemented in customized service (Chen et al, 2005) which is vital in retail markets to develop customer relationship. Therefore, this research focuses to provide customised service to distinct customer segments in retail supermarkets, by implementing data mining techniques with the help of data mining tools.
Researchers proposed various approaches to mine sales transaction data of a retail supermarket to improve customer relationships. Previously, the customer behavioural variables such as (RFM) Recency-Frequency-Monetary variables are associated with demographic variables to predict customer purchase behaviour (Chen et al, 2005). Current research improved significantly, as Business Intelligence tools and advanced data mining algorithms are implemented to analyse the data in a much more reformed way.
Liao et al, (2008), proposed a methodology based on Apriori and K-means algorithms to mine the customer knowledge from household customers for product and brand extension in retailing. Bottcher et al, (2009), presented an approach which aimed to mine the changing customer segments in dynamic market through deriving frequent itemsets as representations of customer segments at different points of time, which are then analysed for changes.
Effective management of sales transaction data is as important as any other asset for a retail supermarket store. The sales transaction data usually contains great amount of information distributed through numerous transactions.
This study focuses on applying data mining techniques to analyse the sales transaction data of a retail supermarket store and suggests recommendations to provide customised service to defined customer segments. This research specifically uses two data mining techniques namely clustering and association rule discovery. The research starts with identifying different customer segments based on their purchase frequencies, in order to find out the differences in their purchase behaviour. The definition of behaviour in retail supermarket domain covers different meanings. For example, retailers often distinguish between light, medium and heavy users or weekday or weekend customers etc (Brijs et al, 2001). In this research, the differences will be discovered by identifying frequently purchased items for each customer segment and comparing their combinations. The retailer may use this information to customize his offer towards those segments and also to further examine the underlying relationships between those items for purposes of pricing, product placement or promotions.
The aim of this research is to provide customised service to defined customer segments in a retail supermarket, by implementing data mining techniques on sales transaction data with the help of data mining tools.
This research follows the quantitative methodology by obtaining the dataset and analysing the data with data mining tools. The dataset for analysis was obtained from ABC retail supermarket store, Canada, which was available online (https://www.statsci.org/datasets.html). The data required for this project is selected and loaded onto data mining tools SPSS (Statistical Package for the Social Sciences) and Weka, the tools selected for this research to mine the data. The data mining algorithms that are selected for this study are k-means algorithm for Clustering and Apriori algorithm for association rule mining, the reason behind the choice of these algorithms is justified in the literature review. These algorithms are implemented on the dataset with SPSS and Weka. The results obtained from these algorithms needs to be justified with the help of charts, tables and graphs. Microsoft Excel is used to plot the charts, tables and graphs. Finally, the recommendations are made based on the analysis of results.
This chapter presents the essence of this dissertation, highlighting the aim and objectives of this research. The rest of this dissertation is structured as follows
Chapter 2 provides a comprehensive literature review of different aspects relating to the research topic under study.
Chapter 3 discuss in detail about the research methods and the data analysis techniques followed, in order to achieve the aim of this research.
Chapter 4 presents the analysis of the results obtained from the application of data mining algorithms on the data and provides recommendations.
Chapter 5 summarises the entire project and gives insights on limitations of this research and points out the areas of future research.
This chapter provides a critical review of literature addressing the application of data mining in retail supermarkets. It begins with an introduction to data mining, followed by its evolution and applications in today's business world. Then explore the role of data mining in retail supermarkets to improve customer relationships, followed by a discussion about the typical data mining approach. It also discusses the techniques and algorithms implied in this project and the reason for their choice.
The word 'mining' means extracting something useful or valuable, such as mining gold from the earth (Lappas, 2007).The importance of mining is growing continuously, especially in the business world. Data mining is a process of finding interesting patterns in databases for decision-making. It is one of the fast growing and most prominent fields, which can provide a significant advantage to an organization by exploiting the vast databases (Rygielski, Wang and Yen, 2002). Finding patterns in business data is not new; traditionally business analysts use statistical approach. The computer revolution and huge databases ranging from few Giga Bytes to Tera Bytes changed this scenario. For e.g. companies like Wal-Mart stores huge amount of sales transaction data, which can be used to analyze the customer buying patterns and make predictions(Bose and Mahapatra, 2001). Data warehousing technology has enabled the companies to store huge amount of data from multiple sources under a unified schema.
Data mining has been considered to be a tool of business intelligence for knowledge discovery (Wang & Wang, 2008). Many people consider data mining as Knowledge Discovery from Data (KDD), but it is actually a part of the larger process called "knowledge discovery" which describes the steps that must be taken to secure the desired results (Han and Jiawei, 2006). Typical data mining process implicates various iterative steps; the first step is the selection of appropriate data from a single database or multiple source systems followed by cleaning and preprocessing for consistency. The data is then analyzed to find patterns and correlations in the data. This approach compliments the other data analysis techniques like statistics, OLAP (On-line analytical processing) etc, (Bose and Mahapatra, 2001). Every organization follows a different data mining and modelling process to achieve their business imperatives.
It all started with the need to store the data in computers and improve the access to it for decision-making. Today the technology enables the users to access and navigate the real time data.
At the beginning of 1960s, the data was collected for the purpose of making simple calculations to answer the business questions like the total average revenue for a specific period of time. In 1980s & 1990s the usage of data warehouses to store data in a structured format emerged, policies regarding the format of data to be used in an organization were implemented (Therling.K, 1998). The data warehouses extended to be multi-dimensional that facilitates the stakeholder to drilldown and navigate through the data.
Nowadays, online analytic tools assist to retrieve the data real-time. Now computers can query data from past to until the current. In recent years many technologies like statistics, AI (Artificial Intelligence) and machine learning have been evolving as core sectors in data mining field(Rygielski, Wang and Yen, 2002). So these technologies combined with relational database systems with data integration provide potential knowledge from the data.
Data mining can be implied in many fields depending on the aim of the company. Some of the main areas in today's business world where data mining is applied are as follows (Apte.C. et al, 2002):
Swift (2001) defined CRM as an "Enterprise approach to understanding and influencing customer behaviour through meaningful communications in order to improve customer acquisition, customer retention, customer loyalty, and customer profitability". According to research by the American management association "It costs three to five times as much to acquire a new customer than to retain the existing one" and is especially evident in services sector (Ennew & Binks, 1996). Therefore it is very important to create a good relationship with the existing and new customer rather than expanding the customer base.
A large number of companies are adopting various tools and strategies to enhance a more effective CRM, in order to gain an in-depth understanding about their customers. Data mining is a powerful new technique, which helps the companies to mine the patterns, trends and correlations in their large amounts of customer, product, or data, to drive improved customer relationships. It is one of the well-known tools given to customer relationship management (CRM) (Giudici & Passerone, 2002). In the context of retail supermarket these patterns not only assists the retailers to offer high quality products and service to their customers, but also helps them to understand the changes in customer needs.
Data mining improves customer relationship in retail supermarket, which is a wide area of research interest. Depending on the retailers' objective, there are various application areas in which data mining can be applied to enhance customer relationship management. Some of the major data mining applications in retail supermarket, identified from literature are as follows:
Ivancsy & Vajk, (2006), defined the three main stages involved in the data mining process which are: (i) preprocessing, (ii) pattern discovery, (iii) pattern analysis/interpretation.
Famili .A, (1997), defined data preprocessing as "all the actions taken before the actual data analysis process starts. It is essentially a transformation T that transforms the raw real world data vectors Xik, to a set of new data vectors Yij".
Yij = T (Xik)
Such that:
In the above relation:
i=1... n where n = number of objects,
j=1... m where m = number of features after preprocessing,
k=1. . . l where l = number of attributes/features before preprocessing, and in general, m ? l.
The most common data used for mining the purchase behaviour in retail supermarket is customer and transaction data (Giudici and Passerone, 2002).
With a huge collection of customers' sales transaction data available in the databases, it is necessary to pre-process the data and extract the useful information from it. In the context of retail supermarkets Pinto et al, (2006), suggested four key tasks in data preprocessing, they are data selection, data cleaning, data transformation, and data understanding.
The first preprocessing task is data selection. Here the subset of the data is identified on which pattern discovery is to be performed. This task is especially helpful in solving the problem of large amounts of data through precisely evaluating and categorizing the data into much smaller datasets. Computational requirements necessary for data analysis and manipulation are also hugely reduced by preprocessing large datasets through data selection techniques like clustering or vector quantization (Famili .A, 1997).
The second is data cleaning where basic operations include removing noise and handling missing data (Fayyad et al, 1996). Other issues regarding the data quality like errors and insufficient attributes which may complicate data analysis are also addressed in data cleaning. In most cases missing attribute values are replaced by attribute mean but traditionally, if more than 20% of attribute values are missing, the entire record is eliminated (Famili .A, 1997). To handle the outliers and noise data, techniques like binning (partitioning the sorted attribute values into bins), clustering and regression are applied.
The next preprocessing task is data transformation. "The application of each data mining algorithm requires the presence of data in a mathematically feasible format" (Crone et al, 2006). Inaccuracies in the measurements of input or incorrect feeding of data to the data mining algorithm could cause various problems. Since, operations such as normalization, aggregation, generalization and attribute construction are performed. Normalization deals with scaling the attribute value into a specific range, whereas aggregation and generalization refers to the summary of data in terms of numeric and nominal attributes. Attribute construction handles the replacement or addition of new attributes based on the existing attributes (Markov.Z and Larose.T.D, 2007).
Once issues regarding the data are solved and the data are prepared, understanding the nature of data would be useful in many ways. According to Famili .A, (1997), the majority of the data analysis tools have some limitations regarding the data characteristics; therefore, it is important to recognize these characteristics for appropriate setup of data analysis process. He further pointed out that techniques like visualization and principal component analysis are useful for better understanding the data.
Fayyad et al, (1996), defined that core of the process is the application of specific data-mining methods for pattern discovery and extraction. Pattern discovery is the key stage of the process in this research, which is where the data is mined. Once the data is pre-processed, and the irrelevant information is eradicated, it is then used for mining, using data mining techniques to discover patterns. However, it is not the intent of this paper to describe all the available algorithms and techniques derived from these fields.
This research focuses on two main data mining methods that to helps to mine the data and find patterns. They are Clustering and Association. The reason behind choosing these rules is justified below.
Clustering can be defined as a technique to group together a set of items having similar characteristics (Kuo et.al, 2002). In retail domain, cluster analysis is a common tool to segment the customers on the basis of their similarity on a chosen segmentation base or set of bases (Stewart.D.W and Girish.P., 1983). The actual choice for one or a combination of these bases largely depends on the business question under study (Wind, Y., 1978).
Segmentation can be done on the basis of various variables/bases, such as 1) general or product-specific, and 2) observable or non-observable as classified by wedel M and Kamakura (2000).
General bases for segmentation are independent of products, services or circumstances, whereas product-specific bases for segmentation are related to the product, the customer or the circumstances. Observable segmentation bases can be measured directly, whereas non-observable bases must be inferred. The combination of classification of segmentation bases is shown below.
Twedt, D.W., (1967) as cited in Engel.J.F et.al, (1972), stated that the existence of huge amounts of transaction data in retail supermarket domain provides a great impetus for segmentation on the basis of purchase frequencies. Segmentation based on this divides customers into groups on their intensity of buying a product(s), such as light, medium and heavy buyers. According to Brijs.T, (2002), if customers are classified by their purchase frequency, these segments could then be treated differently in terms of marketing communication (pricing, promotion, product recommendation etc.) to achieve greater return of investment (ROI) and customer satisfaction. Therefore, in this research clustering is employed to segment the customers into various clusters on the basis of their similarity in purchase frequency.
Several algorithms have been proposed in the literature for clustering, such as ISODATA, CLARA, CLARANS, ScaleKM, P-CLUSTER, DBSCAN, Ejcluster, BIRCH and GRIDCLUS (Kanungo.T. et al, 2002). It is not the objective of this research to use all these algorithms for clustering. However, as discussed earlier, k-means clustering algorithm would be used to cluster and its justification is given below.
The K-means has been considered as one of the most effective algorithms in producing good clustering results for many practical applications (Alsabti et.al, 1998). The main reason behind this is, when clustering is done for the purpose of data reduction, the goal is not to find the best partitioning, but simply needs a reasonable consolidation of N data points into k clusters, and, if necessary, some efficient way to improve the quality of the initial partitioning (Faber, 1994). Therefore, k-means algorithm proves to be very effective in data reduction and produces a good clustering output.
The k-means algorithm clusters the data that are similar into various clusters namely Cluster 0, Cluster 1 to Cluster n (Kanungo et.al, 2002). Provided a set of n data points in real d dimensional space (Rd) and an integer k, the aim is to determine k points in Rd, called the centers, so as to minimize the mean squared distance from each data point to its nearest center. This measure is often called as squared-error distortion (Jain & Dubes, 1988).
The diagram below illustrates the standard k-means algorithm. It shows the results during two iterations in the partitioning of nine two-dimensional data points into two well separated clusters. Points in cluster 1 are shown in red, points in cluster 2 are shown in black; data points are denoted by open circles and reference points by filled circles. Clusters are indicated by dashed lines. The iteration converges quickly to the correct clustering; even there was a bad initial choice of reference points.
Lloyd's algorithm is another popular version for K-means clustering which requires about the same amount of computation for a single pass through all the data points, or a single iteration, like the standard K-means algorithm (Faber, 1994). Lloyd's algorithm is similar to standard k-means algorithm, except when the cluster centroids are chosen as reference points in subsequent partition; the centroids are adjusted both during and after each partition. However, the k-means algorithm constantly updates the clusters and requires comparatively less iterations than Lloyd's algorithm, thus, k- means algorithm is considerably faster. This is the key reason that leads to the selection of k-means algorithm, since it can group the customers which have similar purchase frequency into different clusters in less iterations. However, Faber, (1994), pointed two major drawbacks to this algorithm. Firstly, it is computationally inefficient for large datasets. Secondly- although the algorithm will always produce the desired number of clusters, the centroids of these clusters may not be particularly representative of the data.
Association rule discovery was proposed to find all rules in a basket data to analyze how items purchased by customer in a shop are related (Gery & Haddad, 2003). The rule refers to the discovery of attribute value associations that occur frequently together within a given data set (Han & Kamber, 2001). It is typically used for market basket analysis to discover rules of the form x% of customers who buy item A and B, also buy item C (Zaiane, 2001) and is an implication of the form (A, B) A¨C.
Some of the key definitions drawn from literature that characterize association rule technique are provided below (Agarwal, Imielinski and Swami, 1993).
Itemset (i) - Set of items that contain in a single transaction (e.g. {milk, sugar, curd})
Support (s) - The support expresses the percentage of transactions in the data that contain both the items in the antecedent and the consequent of the rule.
Confidence (c) - Confidence estimates the conditional probability of B given A, i.e. P (B |A) and it can be calculated as Confidence (c) =s (A & B) / s (A).
Association rule discovery typically involves a two phased sequential methodology (Brijs T., 2002).
The first phase involves looking for so-called frequent itemsets, i.e. itemsets for which the support in the database equals or exceeds the minimum support threshold set by the user. This is computationally the most complex phase because of the number of possible combinations of items that need to be tested for their support.
Once all frequent itemsets are known, the discovery of association rules is comparatively straightforward. The general scheme is that, if ABCD and AB are frequent itemsets, then it can be calculated whether the rule AB A¨ CD holds with sufficient confidence by computing the ratio confidence = s (ABCD) / s (AB). If the confidence of the rule equals or exceeds the minconf threshold set by the user, then it is a valid rule. For an itemset of size k, there are potentially 2k-2 confident rules.
Association rules can help to discover frequently purchased combinations of products within a customer segment and provide customised service by promoting certain products or product combinations to the defined segments (Brijs T. et al, 2001). Therefore, in this research, frequent itemsets for each customer cluster will be generated and their combinations are compared to identify the differences in purchase behaviour to provide customised service.
Traditionally, support and confidence are used in association rule discovery, but Aggarwal & Yu, (1998), criticized this support-confidence framework for association rule discovery for the following main reasons.
Further Agarwal & Yu, (1998); Brin et al., (1998), as cited in Brijs.T,(2003), introduced the lift (also called interest) measure to overcome the disadvantage of confidence in not taking the baseline frequency of the consequent into account.
Lift/Interest (l) - Lift is computed as the confidence of the rule divided by the support of the right-hand-side (RHS). In other words, lift is the ratio of the probability that A and B occur together to the multiple of the two individual probabilities for A and B.
Lift (l) = s (A & B) / s (A).s (B)
In order to perform predictive analysis, it is useful to discover interesting patterns in the given dataset that serve as the base for future trends. The best and most popular algorithm used for this analysis is called the Apriori algorithm (Varde et.al, 2004).
The Apriori algorithm was proposed by Agarwal et.al, (1994) (Varde et.al, 2004). The algorithm finds frequent items in a given data set using the anti-monotone constraint (Petrucelli et.al, 1999), as cited in Varde et.al, 2004).
It works under the principle that 'all subsets of a frequent itemset must also be frequent'. In other words, if at least one subset of an itemset is not frequent, the itemset can never be frequent anymore. This principle simplifies the discovery of frequent itemsets significantly because for some itemsets, it can be determined that they can never be frequent before checking their support against the data anymore. This is the key reason to select this algorithm, since the association rules for the items can be discovered more quickly and efficiently.
Though the algorithm is very efficient in association rule mining, it has certain drawbacks, found by Margahny & Shakour, (2006).
There are several tools available for clustering and association rule mining such as ARMiner, Clementine (SPSS), Enterprise Miner (SAS), Intelligent Miner (IBM), Decision Series (NeoVista). To mine association rules, WEKA is used, which is a collection of machine learning algorithms for data mining tasks and SPSS statistics 17.0 for clustering. WEKA is an open source software available online and very efficient in mining large datasets, where as SPSS statistics 17.0 is a statistical analysis package available at Brunel university computer labs.
Pattern analysis means understanding the results obtained by the algorithms and drawing conclusions. This is the last phase in data mining process, where the uninteresting rules or patterns from the set found in the pattern discovery phase are filtered out (Cooley et.al, 2000). The uninteresting patterns are filtered out by applying appropriate methodologies on the results and produce some interesting statistical patterns.
This chapter discussed the concept of data mining, its evolution and applications in today's business world. Then, it provided an overview regarding the role of data mining in retail supermarkets to improve customer relationships, followed by a discussion about the typical data mining approach. It also discussed the techniques and algorithms implied in this project and the reason for their choice. The following chapter will explain about the research approach followed in this dissertation.
This chapter will discuss about the research approach employed in this project. It starts with a discussion about the research and literature review methods, followed by the data collection and justification of data mining approach on the data.
The research approach depends upon the objectives and aim of the study, as it assists the researcher to elicit appropriate responses. Boyatzis (1998) defines research methods as taxonomic procedure used for problem solving where, first data is collected based on the research question, hypotheses are stated, data analysis is carried out using appropriate techniques, results are interpreted and conclusions are derived. According to Hussey et al (1997), research methods can be distinguished in two types they are Qualitative and Quantitative approach. Oates (2006) says that, quantitative research method is the data or evidence on numbers whereas qualitative research method includes all non-numeric.
In this research, quantitative research methodology is used. Quantitative study makes use of the numeric data that has been collected from a group of people interested in the subject area which is then analysed and interpreted with statistical tools and results are derived (Cresswell, 1994). Although, Kaplan & Duchon, 1988 suggested that the benefits of quantitative research are more obvious in social, behavioural and organizational field. However, it takes more cost, time and resource to collect the data. The main purpose of quantitative research is that, it is used to measure how many people feel, think or act in a particular way (Bryman & Bell, 2007). This research measures how people feel, think or act in the context of retail supermarket and suggests certain recommendations by analysing data and displaying it in a statistical way
The secondary data for the research was obtained from literature review. Literature review was useful in finding what work has been previously done and to identify the areas that need to be addressed. It helps to decide upon a viable research question that has not been fully addressed and also helps to critically evaluate previous work (Oates, 2006).
As the nature of research in CRM and data mining is complicated to restrict to specific disciplines, the relevant materials are spread across various journals. The search is carried out utilizing intensive research of various books, journals, publications and other related materials in Data mining for CRM in retail supermarkets. The Brunel Gateway is used for searching the major journals, especially data mining and business journals between the time periods of 1980 till date (2010). The literature is iteratively collected, analysed and selected depending on its relevance.
The literature search was based on the descriptor, "customer relationship management" and "data mining", which originally produced number of articles. The search is narrowed down in relevance to data mining for CRM in retail supermarkets to find the appropriate material. The abstract and introduction of each article is reviewed to select the appropriate articles for different parts of the thesis.
At the core of quantitative data analysis is the dataset. A dataset is actually just an array of numbers organized into rows and columns (Connolly, 2007). Dataset contain huge amount of data, which needs to be pre-processed and then put into for further use.
For this research study, it this important to obtain sales transaction datasets of any retail supermarket store. The main purpose of obtaining the datasets was to perform data mining tasks on the data and suggest relevant recommendations in order provide customised service to the distinct customer segments, by showing the outcome statistically. The dataset obtained from ABC retail supermarket store contains three months of retail sales transaction data, which was collected from May 27, 2003 till august 31, 2003, a total of 93 days (approximately 12 weeks) with 15,352 transactions.
The data mining process discussed in Literature Review is practically applied on the dataset obtained to achieve the aim of this project. There are several research strategies to justify the data mining approach on the datasets. This research follows its own unique approach to apply the three phases of data mining process to accomplish the aim of the research. Bryman & Cramer, (2005) suggested that "A researcher who has conducted an experiment may be interested in the extent to which experimental and control groups differ in some extent". For example some researchers would be interested in monitoring the entire data without preprocessing, and some may be interested only in the items that have been purchased. Each research has its own approach to achieve the aim. Likewise, in this approach the three main phases of data mining process are applied as follows.
The first phase is data selection and preprocessing, where a number of data cleaning and transformation issues need to be addressed before it can be used for analysis. The dataset used in this research was obtained with a '.gz' extension, which contains customer IDs and their corresponding transactions. All the items and the customers in the dataset are represented by unique integers. However, some item numbers in the dataset represent a group of items rather than an individual item (e.g. unpreserved items like vegetables, fruits, meat and a few others).
The key tasks in data preprocessing include identification of number of customers who at least purchased once from this supermarket store, the number of transactions made by them and total number of items the store carried during the period under study. This information helps us to cluster the customers based on their purchase frequency and to generate frequent itemsets for each cluster, in order to identify the differences in purchasing behaviour of each cluster.
The second stage is pattern discovery. In this stage, patterns are discovered from the preprocessed dataset through implementing the k-means and Apriori algorithms discussed in the previous chapter.
The first objective in pattern discovery is clustering. In this research, clustering is used to segment the customers into various clusters based on their purchase frequency. Furthermore, the k-means algorithm is implemented on the dataset to group the customers into different clusters. Each cluster signifies a particular customers' purchase behaviour.
The second objective is to generate frequent itemsets for each cluster with the support in the dataset exceeds the predefined minimum support threshold. Zheng, Kohavi, and Mason, 2001 used minimum support settings ranging 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, and 0.01% in the context of association rule mining. De Schamphelaere et al, 2003 used even lower support percentage of 0.009% for DIY (Do It Yourself) stores. As mentioned earlier, the Apriori algorithm is used to generate the frequent itemsets with an absolute support count=48, which equals a support percentage of 0.3%. Therefore, an item or a set of items is considered frequent if it occurs at least in 48 transactions, which approximately corresponds to an item or set of items to be purchased at least four times per week. The support threshold was selected through trial and error exercises (support count ranging from 30 to 50) and a minimum of 4 purchases per week is implied as acceptable.
More specifically, this research focuses on identifying the differences in purchase behaviour by generating frequent itemsets for each customer cluster and comparing their combinations. In other words, this study intends to find out whether some items are tend to be purchased more frequently, or whether the appearance of these items is much different between the defined segments. The retailer may use this information to customize his offer towards each segment and also to further examine the underlying relationships between the items for purposes of pricing, product placement or promotions. Furthermore, support metric of each frequent itemset is used as a means of comparison.
The final stage is pattern analysis which involves in understanding the results obtained by the algorithms and drawing conclusions. The results obtained from the algorithms will be explained in the analysis chapter. The results are statistically shown through graphs, tables and charts in Microsoft Excel for arriving at a conclusion.
Quantitative research method is used for this study, since the study involves data collection and analysis of the data. This chapter represents the approach for data collection and how the various algorithms are implemented on the dataset. The results of the algorithms and the conclusions derived from it are discussed in the next chapter.
This chapter presents the results drawn from the summarised analysis of the collected data. The analysis is based on the data mining approach discussed earlier, which includes data preprocessing and the implementation of k-means and Apriori algorithms on the acquired data. Finally this chapter concludes with recommendations to the retailer to provide customised service to each customer segments.
Before the data can be used for analysis, there are certain data preprocessing issues that need to be addressed. As mentioned earlier the key tasks in this stage are to identify number of customers, their purchase frequency and the transactions made by them during the period of data collection. To accomplish this, the data was initially viewed in notepad and then copied in to Ms excel in comma delimited form (.csv). By using filtering techniques in Ms Excel, it has been identified that the data consists of 15,352 transactions in which 6327 transactions were identifiable with corresponding customer ID and the rest of the transactions were made by anonymous customers. Further, there were 639 customers who at least purchased once from this supermarket and a total of 9972 items carried by the store during the period of data collection.
Now it is required to analyse the purchasing frequency pattern of these 639 customers during the period of 12 weeks. This helps us to study the purchase behaviour of all the customers.
Purchase frequency of each customer in the period of 12 weeks
The above graph illustrates the purchase frequency of each customer for the whole 12 weeks. The x-axis represents the customers and the y-axis represents the frequency. There are very few customers with an average purchase frequency of >15 and <5. Most of them are between the frequencies of 5 to15. The graph clearly depicts the fluctuation in the frequency pattern of the customers. This collaborates that not all the customers possess similar purchase behaviour or value. Since the retailer needs to identify the differences in purchase behaviour to provide customised service and improved customer satisfaction.
From the above analysis, uniqueness of all the customers has been learned. Now, the customers need to be clustered into various clusters based on the purchase frequency. Then, in order to identify the differences in their purchase behaviour, frequent itemsets for each cluster should be generated and compare their combinations to offer customised service towards the defined segments.
Brijs.T, (2001), pointed out that if customers can be classified by purchase frequency; these segments could then be treated differently in terms of marketing communication (pricing, promotion, product recommendation etc.) to achieve greater return of investment (ROI) and customer satisfaction. Therefore, in this research clustering is used to segment the customers into various clusters based on their purchase frequency.
Initially, the data is loaded into SPSS, which provides various algorithms for clustering. Simple K- Means algorithm is used to cluster the customers, as mentioned earlier.
Secondly, the researcher has to decide the number of clusters the customers to be segregated. At first, the plan was to divide the customers into three clusters of similar purchase frequencies. Since, there are too many customers to cluster; it was decided to cluster into 4. The number of clusters, Simple K- Means algorithm to group can be chosen in SPSS. The clustering results show the cluster each customer belongs to, number of customers in each cluster and the percentage of each cluster.
The above figure illustrates the percentage of each cluster. The percentage is calculated on the basis of dividing the number of customers in each cluster by the maximum instance and multiplied by 100. Here the maximum instance means to the total number of customers identified i.e. 639. As noticed, the cluster 3 has the highest percentage of 48%, followed by cluster 2 with 25%, cluster 4 with 23% and cluster 1 with 4%. Each cluster has a group of customers, which shows similar purchase frequencies.
Generally, the dissimilarity between two clusters is calculated using the distance measures. A simple distance measure like Euclidean distance can often be used to reflect dissimilarity between two clusters. Greater distances between clusters correspond to greater dissimilarities (Jain, A.K., Murty. M.N. and Flynn. P.J., 1999).
This table below shows the Euclidean distances (The distance between two points that is computed by joining them with a straight line (Witten & Frank, 2005)) between the final cluster centers. The final cluster centers are computed as the mean for each variable within each final cluster and reflect the characteristics of the typical case for each cluster.
Some of the most striking differences we can notice are as follows
As the difference between each cluster is studied, the subsequent step is to identify the customers that belong to each cluster and their corresponding purchase frequency.
The customer clustering results of clusters 1-4 are depicted on graphs 4.2, 4.3, 4.4 and 4.5 respectively. The customers are plotted in the x-axis and their purchase frequency in the y-axis. The purchase frequency represents the number of purchases the customer made during the period of 12 weeks.
The figure 4.2 shows the number of customers who belong to cluster 1 and their respective purchase frequencies. It is discovered that 27 customers who show similar purchasing frequencies, which was identified by Simple K-means algorithm. It is observed that the purchasing frequencies are ranging very high from 16 to 23 with an average of 17.67 times in the whole 12 weeks. The customers in this cluster made a total of 477 transactions. Even though, this cluster is relatively small (4%), the customers in this cluster are the most active customers characterised by their high average purchase frequency.
The figure 4.3 illustrates the results of cluster 2. This cluster is considered to be the second largest (25%) with a total of 162 customers who made 2118 transactions in total. The purchase frequencies are ranging from 12 to 15 with an average of 13.07 times in the entire 12 week period. It clearly suggests that, the customers in this cluster are making at least one purchase per week, therefore this cluster is considered to be reliable both in terms of its size and the purchase frequency of the customers.
The figure 4.4 represents the results of cluster 3. This is the largest (48%) of all characterised by less than one purchase per week. This cluster consists of 307 customers who made 2891 transactions in total with the purchase frequencies ranging from 8 to 11. The average purchases made by the customers are 9.42 times during the period of 12 weeks. Although this cluster is the largest, it is characterised by relatively low average purchase frequency compared to the average purchase frequency of the customer in the entire selected data (9.96 times). Therefore, it would be important to think of ways to encourage the purchase rate of customers in this cluster.
The figure 4.5 depicts the results of cluster 4. This is the third largest cluster (23%) of all consisting of 143 customers who made a total of 841 transactions with a purchase frequency ranging from 1 to 7. The average purchases made by the customers are 5.88 during the period of 12 weeks. Comparatively this cluster shows very less purchase frequency with less than one purchase per two weeks. It is likely that the customers in this cluster are new or they prefer to purchase non-perishable items which are mostly purchased occasionally. Given the size of this cluster, this cluster should be treated separately by means of customised service.
The first objective of segmenting the customers based on their purchase frequencies is accomplished by implementing the K-means algorithm on the dataset and grouped the customers into different clusters. The next objective is to identify the differences in purchase behaviour by generating frequent itemsets for each cluster and comparing their combinations.
As discussed in previous chapter, frequent itemsets were generated for each cluster with the Apriori algorithm on the prepared data with a support percentage of 0.3%, which equals an absolute support count=48. Therefore, an item or a set of items is considered frequent if it occurs at least in 48 transactions, which approximately corresponds to an item or set of items to be purchased at least four times per week. This support percentage and absolute support count are derived based on the whole 15,352 transactions made in 12 weeks. But it is obvious that the absolute support count will vary for each cluster depending on the number of transactions made by the customers in that defined cluster. Furthermore, we use support of each frequent itemset as a means of comparison.
Total 1073 frequent itemsets of size 1, 2, 3 and 4 were generated from all the clusters with support percentage ranging from 0.3 to 63.50. The majority of the frequent itemsets are of size 1, 2, 3 and that larger sets are rather exceptional. At this point of time, it can be argued that the variance in the number of itemsets between the clusters is mainly influenced by the unique items that make up the customers' transactions of that defined cluster. This can be observed by studying the ratio between the number of items and transactions of each cluster.
When looking at the ratio (1/2), we can clearly accentuate that the higher the ratio between the number of items and transactions in each cluster the more the itemsets generated. Statistically speaking the probability of an item or a set of items occurring together in a transaction is more for the clusters with higher ratio of the number of items and transactions.
Now the next step focuses on identifying the differences in purchasing behaviour between the clusters by comparing the combinations of frequent itemsets for each cluster. As mentioned earlier we use support of each frequent itemset as a means of comparison. Given the space limitation and the high number of frequent itemsets generated for each cluster, a comparative analysis of the purchasing behaviour among the clusters becomes cumbersome. Therefore, only the itemsets those are common for all the clusters were included in the analysis. This is indeed a limitation of this research. However, with regard to the purchasing behaviour it is believed that these common itemsets are the most important ones to influence the differences in purchasing behaviour between the clusters, because the support of the uncommon itemsets are typically near to or equal to the minimum support of 0.3 and will thus not influence the purchasing behaviour significantly. More detailed figures can be found about itemsets of each cluster in appendix 1.
The highlighted cells in the above table indicate the highest support value of the itemset among all the clusters. The following section will summarize the most prevalent characteristics of each cluster based on the itemsets' supports.
Cluster 1 is the smallest of all clusters (4%) with 27 customers and has higher purchasing frequencies ranging from 16-23 times during the period of 12 weeks. This cluster has relatively high supports for items 38, 170, 101, 36, 110, 185, 37, 260 and 245. When looking at their combinations with other items, some interesting information can be revealed. Some of the striking differences are as follows
This cluster is considered to be the second largest (25%) and shows high supports for items 32, 60, 475, 9, 271, 533, 18 and their combinations. Interestingly in this cluster, items 170, 89 and 110 showed very low support with other items compared to their own independent support. Some of them are as follows
Cluster 3 is the largest (48%) of all characterised by higher supports for items 89, 237,123, 65, 270 and 11. Some of the interesting differences of this cluster are as follows.
This is the third largest cluster (23%) of all characterised by relatively less support for most of the items. Only items 39, 48, 45 exhibited high supports in this cluster. Some of the key variations are as follows.
Based on the above results and analysis, actions or recommendations can be undertaken to motivate or modify the purchase behaviour of each cluster. The key recommendations derived are as follows.
First of all, the support of the items 39,48,38,32 and their combinations are much stronger in all the four clusters. Hence, one can strongly assume that these items belong to the fresh product categories since the fact that most of the fresh product categories include perishable items and these items must be purchased consistently which may explicate their relatively high support. Therefore these fresh product categories are strategically significant to the retailer in order to keep his customers satisfied.
Secondly, purchase rates for items within each cluster can be stimulated through customised offers and service, specific to that defined cluster. For instance, in cluster 1 item 9 and 23 has showed less independent support (1.68 & 0.63) compared to the other clusters, but holds a reliable support along with item 39 (1.26 & 0.63). Since, it is sensible to offer customized reduction vouchers for items 9 and 23 along with the purchase of item 39 to the customers in this cluster. Similarly, in cluster 2 and 3, item 170 showed comparatively less support (2.60 & 2.14), so it could be worthwhile to print special offer on the cash receipts related to the purchase of item 170 to the customers in these clusters.
In terms of cluster 4, the customers in this cluster comparatively showed very less purchase frequency with less than one purchase per two weeks. Given the size of this cluster (23%), it would be beneficial to think of ways to encourage the purchase frequency of the customers in this cluster. It could be productive to offer customised promotion leaflets and discount vouchers for the purchase of certain items within a specific date in order to attract them to visit the store.
These recommendations are proposed with strong evidence obtained by the application of advanced data mining tools and thus help in providing customised service to defined customer segments in a retail supermarket.
This chapter discussed about the data selected for the research, the clustering and association results with the help of graphs, charts and tables. Finally, recommendations are suggested to help in providing customised service to defined customer segments in the context of retail supermarket based on the analysis of the results. The next chapter summarises the entire project by highlighting the main points, research limitations and possibilities of further research.
This chapter presents a summary of the entire project along with key findings, research limitations and the areas of future research
The aim of this research was to implement the data mining techniques on sales transaction data with the help of data mining tools, in order to suggest the recommendations for customised service to the defined customer segments in a retail supermarket. This has been achieved through critical review of literature within the discipline, obtaining and preprocessing the sales transaction data, implementing the data mining algorithms discussed in the literature on the pre-processed data, and providing suggestions using the analysis of results from the algorithms.
The first chapter briefly discussed the data mining implications in retail supermarkets and highlighted the research aim, objectives, research approach to achieve them and dissertation outline.
To achieve the objectives, the second chapter identified the critical aspects in the literature, which are used to mine sales transaction data of a retail supermarket to improve customer relationships. Mainly, this chapter involved a detailed description of the data mining process, which explains about the various steps involved in preprocessing the data, how simple k-means and Apriori algorithms best suits the project for pattern discovery. Finally, it provides a brief discussion about the SPSS and WEKA, the tools used for the implementation of the algorithms on the data.
The third objective of this research was addressed in chapter three, which discusses about the research approach that has been adopted for this study. This chapter describes how the data mining process discussed in literature review chapter is practically applied on the dataset. This includes how the data has been pre-processed and the algorithms are implemented on it with the help of SPSS and WEKA, the tools selected for this research.
The fourth chapter provides the analysis of the results obtained from the data mining algorithms with the help of graphs, charts and tables. The analysis is based on clustering the customers into various clusters with similar purchase frequencies and identifying the differences in their purchasing behaviour through generating frequent itemsets for each cluster and comparing their combinations. Based on this analysis, recommendations were made to provide customised service to the defined customer segments of the retail supermarket under study. Thus this chapter successfully achieved the fourth objective of this research.
The last objective is addressed by the final chapter 5, which summarises the whole dissertation and throws light on the research contributions, limitations of this research and concludes with a discussion on future research and development.
Previously, sales transaction data is available more consistently than the customer data in retail supermarkets. Nowadays mainly due to the loyalty card programs (e.g. Tesco club card) customers are more identifiable than before. This availability of customer data along with transaction data supports the data mining in retail supermarket to create effective customer-oriented solutions. In reality, no one data mining algorithm is capable of addressing a business problem; rather it is a customised solution with a combination of various techniques. Similarly, this research targets the problem of customised service for specific customer segments in a retail supermarket.
From the academic point of view, most of the research in the field of clustering and association rules has been focussed on improving their computational efficiency and less on their practical application. Regardless the progress achieved in computational efficiency, majority of the data mining tools available are still relying on the basic algorithms like k-means for clustering and Apriori for association rule discovery. Therefore, this project delivers a major contribution academically by presenting the practical utility of both k-means and Apriori algorithms in the context of retail supermarket, to solve the above stated problem.
From the practical perspective this research contributes even more. The recent trends like decreased profit margins and increased discount rates in the retail supermarket domain, creates new challenges for the retailer to stay competitive. These growing trends further have a significant impact on the customer relationships, since the customers are offered with great range of choices from various retailers. Hence, an increased focus on customer is required to cope up with these challenges and that data mining techniques are useful to better understand the customer. In fact, in this dissertation the customers' sales transaction data was analysed to reveal interesting knowledge about their purchasing behaviour in order to provide customised offers to the specific customer segments. This indeed helps the retailer to gain the customer loyalty and in turn achieve greater return of investment.
One of the main limitations of this research is that, due to the lack of data this research doesn't take any monetary variable for customer segmentation. Indeed the segmentation is based on their frequency of purchases. However, the purchase frequency variable is considered to be the main criterion that indicates the customers' loyalty and that of taking monetary condition for segmentation may result in missing the customers that possess high potential value, but not yet belong to the high profitable segment.
Another limitation is that this research exploited only one algorithm for clustering and one for association rule discovery. Therefore, the patterns obtained from these algorithms are only discussed. The next limitation is with the default heap size setting of 63.6 megabyte in WEKA, as generating frequent itemsets for the chosen support threshold of 0.3 is unfeasible with this default setting. However, the heap size can be increased temporarily in DOS command line interpreter (e.g. C:Program Files (x86)Javajre6bin>java.exe -Xmx256m -jar "C:UsersSatyaprakash BALLADesktopweka.jar"). Furthermore, only support metric is used for the comparison of purchasing behaviour between the customer segments.
While seeking to overcome the above limitations, in future more than one metric value (support, confidence, lift) can be used for the comparison of purchasing behaviour. Likewise, for clustering and association rule discovery more than one algorithm for each can be implemented and compare the results to identify their merits and demerits. Further, the procedure employed in this study can be applied on datasets from other domains like telecommunication, banking and insurance.
Finally, the current research in data mining achieved a lot of progress in increasing the computational efficiency of the algorithms and focussed less on their practical application. Therefore, this research calls for future studies to look more into the practical utility of the algorithms.
The Evolution of Data Mining. (2017, Jun 26).
Retrieved November 21, 2024 , from
https://studydriver.com/the-evolution-of-data-mining/
A professional writer will make a clear, mistake-free paper for you!
Get help with your assignmentPlease check your inbox
Hi!
I'm Amy :)
I can help you save hours on your homework. Let's start by finding a writer.
Find Writer