Differences between revisions 3 and 5 (spanning 2 versions)
Revision 3 as of 2004-04-09 15:20:02
Size: 755
Editor: yakko
Comment:
Revision 5 as of 2004-04-09 16:01:26
Size: 2390
Editor: yakko
Comment:
Deletions are marked like this. Additions are marked like this.
Line 26: Line 26:
== Clustering Methods ==

'''Nonhierarchical methods (partitioning methods)''' divide the data set of N objects into M clusters, where ''no overlap is allowed''. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)

'''Hierarchical methods''' produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The ''agglomerative'' method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is
http://www-csli.stanford.edu/~schuetze/completelink.html

The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.

   * '''Single Link''': Joins the most similar pair of objects that are not yet in the same cluster
   * '''Complete Link''': Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.
   * '''Group Avaerage Link''': Uses average values of pairwise links within a cluster to determine similarity.
   * '''Ward's Method''': also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.

=== NonHierarchical single pass method ===


Back to ComputerTerms, InformationRetrieval

Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups.

Similarity Measures:

               2C
 S         = -------
  (Di,dj)     A + B 

   Where C is the number of terms that Di and Dj have in common, 
   and A and B are the number of termsin Di and Dj

Similarity Matrix calculates a similarity measure between document x and y

  | S21                 |
  | S31  S32            |
  |  ...                |
  | SN1  SN2 ...SN(N-1) |

Clustering Methods

Nonhierarchical methods (partitioning methods) divide the data set of N objects into M clusters, where no overlap is allowed. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)

Hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The agglomerative method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html

The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.

  • Single Link: Joins the most similar pair of objects that are not yet in the same cluster

  • Complete Link: Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.

  • Group Avaerage Link: Uses average values of pairwise links within a cluster to determine similarity.

  • Ward's Method: also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.

NonHierarchical single pass method

Back to ComputerTerms, InformationRetrieval

ClusteringAlgorithms (last edited 2004-04-09 16:40:59 by yakko)