Differences between revisions 3 and 6 (spanning 3 versions)

Back to ComputerTerms, InformationRetrieval

Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups.

Similarity Measures:

               2C
 S         = -------
  (Di,dj)     A + B 

   Where C is the number of terms that Di and Dj have in common, 
   and A and B are the number of termsin Di and Dj

Similarity Matrix calculates a similarity measure between document x and y

  | S21                 |
  | S31  S32            |
  |  ...                |
  | SN1  SN2 ...SN(N-1) |

Clustering Methods

Nonhierarchical methods (partitioning methods) divide the data set of N objects into M clusters, where no overlap is allowed. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)

Hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The agglomerative method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html

The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.

Single Link: Joins the most similar pair of objects that are not yet in the same cluster
Complete Link: Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.
Group Avaerage Link: Uses average values of pairwise links within a cluster to determine similarity.
Ward's Method: also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.

NonHierarchical Methods

Used in early research, but now considered goash.

single pass method

Assign the first document D1 as the representative for C1.
For Di, caclulate the similarity S with the representative for each existing cluster.
If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster.
If an item Di remains to be clustered, return to step 2.

Reallocation Methods

Select M cluster representatives or centroids
For i=1 to N, assign Di to the most similar centroid
For j=1 to M, recaclulate the cluster centroid Cj
Repeat 2-3 until there is little or no change in cluster membership during a pass through the file.

Back to ComputerTerms, InformationRetrieval

-  ⇤ ← Revision 3 as of 2004-04-09 15:20:02 → 
  Size: 755
  Editor: yakko
  Comment:
+   ← Revision 6 as of 2004-04-09 16:15:14 → ⇥
  Size: 3180
  Editor: yakko
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 26:
+= Clustering Methods =

'''Nonhierarchical methods (partitioning methods)''' divide the data set of N objects into M clusters, where ''no overlap is allowed''. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM) 

'''Hierarchical methods''' produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The ''agglomerative'' method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is
http://www-csli.stanford.edu/~schuetze/completelink.html

The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.

   * '''Single Link''': Joins the most similar pair of objects that are not yet in the same cluster
   * '''Complete Link''': Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.
   * '''Group Avaerage Link''': Uses average values of pairwise links within a cluster to determine similarity.
   * '''Ward's Method''': also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.



== NonHierarchical Methods ==

Used in early research, but now considered goash.

=== single pass method ===

   1. Assign the first document D1 as the representative for C1.
   1. For Di, caclulate the similarity S with the representative for each existing cluster.
   1. If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster.
   1. If an item Di remains to be clustered, return to step 2.

=== Reallocation Methods ===

   1. Select M cluster representatives or centroids
   1. For i=1 to N, assign Di to the most similar centroid
   1. For j=1 to M, recaclulate the cluster centroid Cj
   1. Repeat 2-3 until there is little or no change in cluster membership during a pass through the file.

Diff for "ClusteringAlgorithms"

Clustering Methods

NonHierarchical Methods

single pass method

Reallocation Methods