Date of Award
6-2004
Degree Name
Doctor of Philosophy
Department
Statistics
First Advisor
Dr. Daniel P. Mihalko
Second Advisor
Dr. Magdalena Niewiadomska-Bugaj
Third Advisor
Dr. Michael R. Stoline
Fourth Advisor
Dr. Jung Chao Wang
Abstract
This study discusses the relationship between measures of similarity which quantify the agreement between two clusterings of the same set of data. This study identifies a family [Special characters omitted.] of similarity measures which are of a special form and attain a maximum value of 1 and becomes identical when corrected for chance agreement. In particular, this study proves that the similarity measures of Rand (R), Hubert (H), and Czekanowski (CZ) are identical when corrected for chance agreement. It also proves that the measures of McConnaughey (MC) and Kulczynski (K) are identical when corrected for chance agreement. Moreover, if the number of clusters produced by each clustering algorithm are the same with equal clusters sizes, then all the similarity measuresin the family [Special characters omitted.] are identical when corrected for chance agreement.
Fowlkes and Mallows (FM) derived the mean and variance of their measure and that of R under the assumptions of fixed marginal totals and independence of the two set of clusterings. This study provides a derivation of the mean and variance of members in the [Special characters omitted.] family ([Special characters omitted.] ) under fixed marginal totals and independence of the two set of clusterings. A simulation study which shows not only the corrected R for chance agreement is recommended for use in clustering structure recovery, but also the corrected FM and Wallace (W) can do as good and generally any measureshould be corrected for chance agreement. This study showed that for large sample size, the difference between the corrected measures using the expectations proposed by Morey and Agresti (1984), which is based on an asymptotic multinomial distribution, and that of Hubert and Arabie (1985) which is based on an exact hypergeometric distribution is negligible.
Finally, a method for determining the number of clusters in a given data set will be investigated through simulations and its performance will be compared to some existing methods. Two real data examples, namely the protein consumption in 25 European countries and the birth and death rates for 74 countries in 1974, will be discussed to show the effectiveness of the proposed method.
Access Setting
Dissertation-Open Access
Recommended Citation
Albatineh, Ahrned Najeeb Khalaf, "On Similarity Measures for Cluster Analysis" (2004). Dissertations. 1082.
https://scholarworks.wmich.edu/dissertations/1082
Comments
5th Advisor: Dr. Robert J. Buck