Back to ComputerTerms, InformationRetrieval
Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups.
Similarity Measures:
2C S = ------- (Di,dj) A + B Where C is the number of terms that Di and Dj have in common, and A and B are the number of termsin Di and Dj
Similarity Matrix calculates a similarity measure between document x and y
| S21 | | S31 S32 | | ... | | SN1 SN2 ...SN(N-1) |
Clustering Methods
Nonhierarchical methods (partitioning methods) divide the data set of N objects into M clusters, where no overlap is allowed. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)
Hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The agglomerative method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html
The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.
Single Link: Joins the most similar pair of objects that are not yet in the same cluster
Complete Link: Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.
Group Avaerage Link: Uses average values of pairwise links within a cluster to determine similarity.
Ward's Method: also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.
NonHierarchical single pass method
Back to ComputerTerms, InformationRetrieval