Size: 755
Comment:
|
Size: 3180
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 26: | Line 26: |
= Clustering Methods = '''Nonhierarchical methods (partitioning methods)''' divide the data set of N objects into M clusters, where ''no overlap is allowed''. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM) '''Hierarchical methods''' produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The ''agglomerative'' method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster. * '''Single Link''': Joins the most similar pair of objects that are not yet in the same cluster * '''Complete Link''': Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters. * '''Group Avaerage Link''': Uses average values of pairwise links within a cluster to determine similarity. * '''Ward's Method''': also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids. == NonHierarchical Methods == Used in early research, but now considered goash. === single pass method === 1. Assign the first document D1 as the representative for C1. 1. For Di, caclulate the similarity S with the representative for each existing cluster. 1. If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster. 1. If an item Di remains to be clustered, return to step 2. === Reallocation Methods === 1. Select M cluster representatives or centroids 1. For i=1 to N, assign Di to the most similar centroid 1. For j=1 to M, recaclulate the cluster centroid Cj 1. Repeat 2-3 until there is little or no change in cluster membership during a pass through the file. |
Back to ComputerTerms, InformationRetrieval
Cluster analysis is a statistical technique used to generate a category structure.The groups which are formed should have a high degree of association between members of hte same group and a low degree between members of different groups.
Similarity Measures:
2C S = ------- (Di,dj) A + B Where C is the number of terms that Di and Dj have in common, and A and B are the number of termsin Di and Dj
Similarity Matrix calculates a similarity measure between document x and y
| S21 | | S31 S32 | | ... | | SN1 SN2 ...SN(N-1) |
Clustering Methods
Nonhierarchical methods (partitioning methods) divide the data set of N objects into M clusters, where no overlap is allowed. Since discovering the optimal grouping of N objects into M sets is Combinations(n|M)=N!/((N-M)!M!) is impossible for large sets, so we use heuristics. Computational requirements are then usually O(NM)
Hierarchical methods produce a nested data set in which pairs of items or clusters are successively linked until every item in the data set is connected. The agglomerative method performs in N - 1 pairwise joins beginning from an unclustered dataset. A good resource on this is http://www-csli.stanford.edu/~schuetze/completelink.html
The procedure is to start off with each object in its own cluster. We then start linking objects together using a link method (described below) until all the objects have been merged into a single heirarchically organized cluster.
Single Link: Joins the most similar pair of objects that are not yet in the same cluster
Complete Link: Uses the least similar pairs between two clusters to determine the intercluster similarity - this results in small tightly bound clusters.
Group Avaerage Link: Uses average values of pairwise links within a cluster to determine similarity.
Ward's Method: also known as the minimum variance method because it joins at each stage teh cluster pair whose merger minimizes the increase in the total within-group error sum of squares based on the Euclidean distance between centroids.
NonHierarchical Methods
Used in early research, but now considered goash.
single pass method
- Assign the first document D1 as the representative for C1.
- For Di, caclulate the similarity S with the representative for each existing cluster.
- If Smax is greater than a threshold value Sr, add the item to the corresponding cluster and recaclulate teh cluster representative; otherwise, use Di to initiate a new cluster.
- If an item Di remains to be clustered, return to step 2.
Reallocation Methods
- Select M cluster representatives or centroids
- For i=1 to N, assign Di to the most similar centroid
- For j=1 to M, recaclulate the cluster centroid Cj
- Repeat 2-3 until there is little or no change in cluster membership during a pass through the file.
Back to ComputerTerms, InformationRetrieval