Dr R Rajendra Prasath and Sumit Goswami, IIT Kharagpur
Clustering is an unsupervised way of putting the objects (data, text, features etc) into coherent groups. It is useful in many application areas in which the identification of associated features is of primary objective. Different types of clustering algorithms include k-nearest neighbor, hierarchical, partitional, mixture-solving and mode seeking, fuzzy, search based, semantic, spectral, artificial neural network (ANN) based, and evolutionary based approaches. Clustering large data is a computationally expensive task. So, the demand of fast clustering approaches is growing. Here we present a fast and efficient clustering tool that uses graph based representation of the objects as nodes and their associations as edges.
How clustering works
Consider a set of unlabelled (raw) objects as shown in Figure. 1(a). Now the task is to apply the clustering algorithm so as to group the objects as shown in the Figure. 1(b)
A basic clustering algorithm includes the following steps:
1) Feature extraction from the given data.
2) Defining the Proximity Measure among the objects; may be a distance measure between objects.
3) Formation of Coherent Groups.
4) Evaluating the purity of the coherent groups/clusters.
5) Presenting the clusters as output.
What does Graclus do
Graclus is a fast clustering tool that computes the clusters from unlabelled data using graph representations. The input data to be clustered is first encoded into a graph with its information in the adjacency list format in either of two ways: simple graph or weighted graph. Simple graph assumes unit weight to the edges between any two nodes. The weighted graph considers different weights to the edges connecting any two nodes. These weights can be computed in user defined ways. One simple approach to compute the weight of an edge between any two nodes is finding the co-occurrence statistics of those two selected objects.
Then, Graclus tool applies normalized cut and ratio association between nodes and edges of a graph without eigenvector computation and clusters the objects using three steps: coarsening, clustering, and refining. Coarsening transforms the given graph into smaller and smaller sub-graphs of desired size. Now, we use these smaller sub-graphs with breadth-first traversal of vertices to form the clusters. Each smaller sub-graph is clustered and the clusters are incrementally added and refined to get the final clusters. The final output contains cluster IDs to which each object belongs to.
Building the Tool
There are three versions of Graclus Open source tool in Gunzipped tar files – ver 1.0, ver 1.1, and ver 1.2 (latest). This tool was originally implemented in C language. The latest version - graclus1.2.tar.gz - can be downloaded from the link (1) mentioned in the Primary Links. The source C files can be extracted from the tar using:
$ tar –xzvf graclus1.2.tar.gz
This will extract the files in the folder namely graclus1.2. The latest version GCC/G++ compiler and MAKE / GMAKE are the prerequisite for compiling this distribution. The following commands are used to build the distribution and get the executable graclus.
$ cd graclus1.2
$ make clean
On successful compilation, this will create “graclus” - the executable file. There is a Matlab interface implemented to perform this fast graph clustering algorithm and the bug free code can be found at: http://ld2.in/43i
Building the Input Graph
Circled objects are nodes and the lines connecting them are edges. We transform the given data into an undirected graph G = (V, E) where V is the set of objects (data, feature, etc.) and E is the set of edges connecting a pair of objects with certain weight. This weight signifies the strength of association between these two objects with respect to the specific feature selected by the user. Since the graph is undirected, we do not take the order of the objects into account i.e. the relation between the objects (1) and (5) is same as the relation between (5) and (1).
Consider the text data:
Doc 1 : Human machine interface for Lab computer
Doc 2 : opinion of computer system response time
Doc 3 : User interface management system
Doc 4 : Human system engineering testing of EPS
Doc 5 : User perceived response time
From the above documents, we extract the unique term list having 17 entries as below:
Each term is considered as a node and the number of documents having the term co-occurrences defines the edge weight. The Serial Number inside ( ) denotes the label of the node associated with the corresponding term. There are two ways to compute the edge weights of the graph: (1) All edges have the same weight and (2) Edges weights are different.
(1) All edges have the same weight: Consider the graph in Figure 2. Here all edges are assumed to have the same weight.
The corresponding adjacency list with graph details of Figure 2 is as follows:
Here the first line contains 3 entries:number of nodes, number of edges, and type of the edge weighting (0 = un-weighted; 1 = weighted). Then each line from second onwards shows link details of the objects 1 to n. Thus in this plain text file, we will have 1+n lines in total. The entries present inside the text box is saved in the plain text file with a file name (Here we save the data in the file name: input.graph)
(2) Edges weights are different: Consider the graph in Figure 3. Here the edges may have distinct edge weights. The corresponding adjacency list with graph details of Figure 3 is as follows:
17 55 1
16 1 2 1 3 1 4 1 5 1 6 1 8 1 9 1 14 1 15 1
1 1 3 1 4 1 5 1 6 1
1 1 2 1 4 1 5 1 6 1 9 1 12 1 13 1
1 1 2 1 3 1 5 1 6 1
1 1 2 1 3 1 4 1 6 1
1 1 2 1 3 1 4 1 5 1 7 1 8 1 9 1 10 1 11 1
6 1 8 1 9 1 10 1 11 1
16 1 1 1 6 1 7 1 9 2 10 1 11 1 14 1 15 1
16 1 1 1 3 1 6 1 7 1 8 2 10 1 11 1 12 1 13 1 14 1 15 1
17 1 6 1 7 1 8 1 9 1 11 2 12 1
17 1 6 1 7 1 8 1 9 1 10 2 12 1
17 1 3 1 9 1 10 1 11 1 13 1
3 1 9 1 12 1
16 1 1 1 8 1 9 1 15 1
16 1 1 1 8 1 9 1 14 1
1 1 8 1 9 1 14 1 15 1
10 1 11 1 12 1
Clustering using Graclus
Now using the graph clustering tool – graclus, we cluster the objects (1) to (17). To do so, we execute the command:
where graclus is the executable tool, graph input file is the plain text graph file having lines equal to the number of objects+1, and m denotes the number of clusters preferred by the user.
Here we use input.graph as the input graph file having entries in 18 lines (one for each object from (1) to (17) and one additional line at the beginning having the statistics of the graph: number of nodes, number of edges and type of the input graph. We choose m=4, that is number is clusters preferred is 4. Thus we execute the following command to get the cluster IDs for the objects (1) – (17).
$ graclus input.graph
This command performs the fast graph clustering with optimal parameters suggested for normalized cuts and ratio association on the input.graph. Once the clustering is finished, graclus generates an output file having m cluster IDs, one per line in the specific output file with the file name: input.graph.part.4. This means that the file contain cluster IDs for the objects: (1) – (17). The Clusters IDs will be from 0 to m-1, where 0 refers to the cluster 1 and 1 refers to cluster 2, and so on.
Processing the output file
Now let us take the output file: input.graph.part.4. Here we perform clustering separately for graph with unit / distinct edge weights respectively. The number of lines in this output file is same as the number of objects considered for clustering. In the output file, the first entry belongs to the cluster to which the first term is belonging to. For e.g. the first entry is 1 implies that the first term 'Human' belongs to cluster 2 and the 12th entry contains 3 which implies that the term 'user' belongs to cluster 4. Thus number of entries in the output file is equal to the number of unique terms. We can group the terms by referring its corresponding cluster IDs from the output file. The following clusters show the terms grouped with respect to the graclus output for graph with unit edge weight.
1: [machine, interface, for, Lab, computer]
2: [Human, of, system, engineering, testing, EPS]
3: [opinion, response, time, management]
4: [User, perceived]
When we use distinct edge weights, as it can be seen from the above clusters, the clusters formed with the associated terms is more coherent and meaningful. This indirectly refers to the purity of the clusters. The clustering with unit edge weights are also equally useful keeping in view of the reduced time and computational efforts. However, the effective clustering is done by the suitable choice of the objective function with the selected normalized cut and ratio association factors.
Graclus tool can be used in effective feature clustering to improve text classification, to improve the efficiency of Image segmentation, to find associated terms for understanding user contexts in Information retrieval, data mining, object/pattern recognition and many other interdisciplinary fields associated with machine learning techniques.