Date of Award

6-2004

Degree Name

Doctor of Philosophy

Department

Statistics

First Advisor

Dr. Daniel P. Mihalko

Second Advisor

Dr. Magdalena Niewiadomska-Bugaj

Third Advisor

Dr. Michael R. Stoline

Fourth Advisor

Dr. Jung Chao Wang

Abstract

This study discusses the relationship between measures of similarity which quantify the agreement between two clusterings of the same set of data. This study identifies a family [Special characters omitted.] of similarity measures which are of a special form and attain a maximum value of 1 and becomes identical when corrected for chance agreement. In particular, this study proves that the similarity measures of Rand (R), Hubert (H), and Czekanowski (CZ) are identical when corrected for chance agreement. It also proves that the measures of McConnaughey (MC) and Kulczynski (K) are identical when corrected for chance agreement. Moreover, if the number of clusters produced by each clustering algorithm are the same with equal clusters sizes, then all the similarity measuresin the family [Special characters omitted.] are identical when corrected for chance agreement.

Fowlkes and Mallows (FM) derived the mean and variance of their measure and that of R under the assumptions of fixed marginal totals and independence of the two set of clusterings. This study provides a derivation of the mean and variance of members in the [Special characters omitted.] family ([Special characters omitted.] ) under fixed marginal totals and independence of the two set of clusterings. A simulation study which shows not only the corrected R for chance agreement is recommended for use in clustering structure recovery, but also the corrected FM and Wallace (W) can do as good and generally any measureshould be corrected for chance agreement. This study showed that for large sample size, the difference between the corrected measures using the expectations proposed by Morey and Agresti (1984), which is based on an asymptotic multinomial distribution, and that of Hubert and Arabie (1985) which is based on an exact hypergeometric distribution is negligible.

Finally, a method for determining the number of clusters in a given data set will be investigated through simulations and its performance will be compared to some existing methods. Two real data examples, namely the protein consumption in 25 European countries and the birth and death rates for 74 countries in 1974, will be discussed to show the effectiveness of the proposed method.

Comments

5th Advisor: Dr. Robert J. Buck

Access Setting

Dissertation-Open Access

Share

COinS