7 Clustering

This chapter describes clustering, the unsupervised mining function for discovering natural groupings in the data.

About Clustering

Clustering analysis finds clusters of data objects that are similar in some sense to one another. The members of a cluster are more like each other than they are like members of other clusters. The goal of clustering analysis is to find high-quality clusters such that the inter-cluster similarity is low and the intra-cluster similarity is high.

Clustering, like classification, is used to segment the data. Unlike classification, clustering models segment data into groups that were not previously defined. Classification models segment data by assigning it to previously-defined classes, which are specified in a target. Clustering models do not use a target.

Clustering is useful for exploring data. If there are many cases and no obvious groupings, clustering algorithms can be used to find natural groupings. Clustering can also serve as a useful data-preprocessing step to identify homogeneous groups on which to build supervised models.

Clustering can also be used for anomaly detection. Once the data has been segmented into clusters, you might find that some cases do not fit well into any clusters. These cases are anomalies or outliers.

Interpreting Clusters

Since known classes are not used in clustering, the interpretation of clusters can present difficulties. How do you know if the clusters can reliably be used for business decision making?

You can analyze clusters by examining information generated by the clustering algorithm. Oracle Data Mining generates the following information about each cluster:

Position in the cluster hierarchy, described in "Cluster Rules"
Rule for the position in the hierarchy, described in "Cluster Rules"
Attribute histograms, desc ribed in "Attribute Histograms"
Cluster centroid, described in "Centroid of a Cluster"

As with other forms of data mining, the process of clustering may be iterative and may require the creation of several models. The removal of irrelevant attributes or the introduction of new attributes may improve the quality of the segments produced by a clustering model.

How are Clusters Computed?

There are several different approaches to the computation of clusters. Clustering algorithms may be characterized as:

Hierarchical — Groups data objects into a hierarchy of clusters. The hierarchy can be formed top-down or bottom-up. Hierarchical methods rely on a distance function to measure the similarity between clusters.

Note:
The clustering algorithms supported by Oracle Data Mining perform hierarchical clustering.
Partitioning — Partitions data objects into a given number of clusters. The clusters are formed in order to optimize an objective criterion such as distance.
Locality-based — Groups neighboring data objects into clusters based on local conditions.
Grid-based — Divides the input space into hyper-rectangular cells, discards the low-density cells, and then combines adjacent high-density cells to form clusters.

Reference:

Campos, M.M., Milenova, B.L., "O-Cluster: Scalable Clustering of Large High Dimensional Data Sets", Oracle Data Mining Technologies, 10 Van De Graaff Drive, Burlington, MA 01803.

http://www.oracle.com/technology/products/bi/odm/

Cluster Rules

Oracle Data Mining performs hierarchical clustering. The leaf clusters are the final clusters generated by the algorithm. Clusters higher up in the hierarchy are intermediate clusters.

Rules describe the data in each cluster. A rule is a conditional statement that captures the logic used to split a parent cluster into child clusters. A rule describes the conditions for a case to be assigned with some probability to a cluster. For example, the following rule applies to cases that are assigned to cluster 19:

IF
 OCCUPATION in Cleric. AND OCCUPATION in Crafts 
     AND OCCUPATION in Exec. 
     AND OCCUPATION in Prof. 
 CUST_GENDER in M
 COUNTRY_NAME in United States of America
 CUST_MARITAL_STATUS in Married
 AFFINITY_CARD in 1.0         
 EDUCATION in < Bach. 
     AND EDUCATION in Bach. 
     AND EDUCATION in HS-grad 
     AND EDUCATION in Masters 
 CUST_INCOME_LEVEL in B: 30,000 - 49,999 
     AND CUST_INCOME_LEVEL in E: 90,000 - 109,999 
 AGE lessOrEqual 0.7 
     AND AGE greaterOrEqual 0.2
THEN
Cluster equal 19.0

Support and Confidence

Support and confidence are metrics that describe the relationships between clustering rules and cases.

Support is the percentage of cases for which the rule holds.

Confidence is the probability that a case described by this rule will actually be assigned to the cluster.

Number of Clusters

The CLUS_NUM_CLUSTERS build setting specifies the maximum number of clusters that can be generated by a clustering algorithm.

Attribute Histograms

In Oracle Data Miner, a histogram represents the distribution of the values of an attribute in a cluster. Figure 7-1 shows a histogram for the distribution of occupations in a cluster of customer data.

In this cluster, about 13% of the customers are craftsmen; about 13% are executives, 2% are farmers, and so on. None of the customers in this cluster are in the armed forces or work in housing sales.

Figure 7-1 Histogram in Oracle Data Miner

Description of "Figure 7-1 Histogram in Oracle Data Miner"

Centroid of a Cluster

The centroid represents the most typical case in a cluster. For example, in a data set of customer ages and incomes, the centroid of each cluster would be a customer of average age and average income in that cluster. If the data set included gender, the centroid would have the gender most frequently represented in the cluster. Figure 7-1 shows the centroid values for a cluster.

The centroid is a prototype. It does not necessarily describe any given case assigned to the cluster. The attribute values for the centroid are the mean of the numerical attributes and the mode of the categorical attributes.

Scoring New Data

Oracle Data Mining supports the scoring operation for clustering. In addition to generating clusters from the build data, clustering models create a Bayesian probability model that can be used to score new data.

Sample Clustering Problems

These examples use the clustering model km_sh_clus_sample, created by one of the Oracle Data Mining sample programs, to show how clustering might be used to find natural groupings in the build data or to score new data.

Figure 7-2 shows six columns and ten rows from the case table used to build the model. Note that no column is designated as a target.

Figure 7-2 Build Data for Clustering

Example: Find Clusters

Suppose you want to segment your customer data before performing further analysis. You could analyze the metrics generated for the data by the clustering algorithm. Figure 7-3 shows clustering details displayed in Oracle Data Miner. The details describe the yrs_residence attribute in cluster 3. It shows that 20% of customers have been in their current residence for 2 years, almost 25% have been in their current residence for 3 years, and so on.

Figure 7-3 Cluster Information for the Build Data

Description of "Figure 7-3 Cluster Information for the Build Data"

Example: Score New Data

Suppose you want to segment a database of regional customer data for marketing research purposes. You might experiment by using a clustering model that you developed for a different region. Figure 7-4 shows some of the cluster assignments in the scored customer data. It shows a 95.5% probability that customer 100,001 is a member of cluster 15, a 89% probability that customer 100,002 is in cluster 6, and so on.

Figure 7-4 Scored Customer Data

Description of "Figure 7-4 Scored Customer Data"

Note:

Oracle Data Miner displays the generalized case ID in the DMR$CASE_ID column of the apply output table. The cluster assignment for each case is displayed in the CLUSTER_ID column. The probability of membership in that cluster is displayed in the PROBABILITY column.

The conditions of membership in a cluster are described in a rule. Figure 7-5 shows the rule for cluster 15.

Figure 7-5 A Cluster Rule

Clustering Algorithms

Oracle Data Mining supports two clustering algorithms: an enhanced version of k-means, and an Oracle proprietary algorithm called Orthogonal Partitioning Clustering (O-Cluster). Both algorithms perform hierarchical clustering.

The main characteristics of the enhanced k-means and O-Cluster algorithms are compared in Table 7-1.

Table 7-1 Clustering Algorithms Compared

Feature	Enhanced k-Means	O-Cluster
Clustering methodolgy	Distance-based	Grid-based
Number of cases	Handles data sets of any size	More appropriate for data sets that have more than 500 cases. Handles large tables through active sampling
Number of attributes	More appropriate for data sets with a low number of attributes	More appropriate for data sets with a high number of attributes
Number of clusters	User-specified	Automatically determined
Hierarchical clustering	Yes	Yes
Probabilistic cluster assignment	Yes	Yes