A Critical Comparison of Answering Behavior Threshold Determination Methods as an Indicator of Engagement on the 2015 PISA Science Items

Date of Award

12-2023

Degree Name

Doctor of Philosophy

Department

Science Education

First Advisor

Betty AJ Adams, Ph.D.

Second Advisor

Charles R Henderson, Ph.D.

Third Advisor

Gary J Miron, Ph.D.

Abstract

The validity of low-stakes achievement tests is questionable when disengaged answering (i.e., rapid-guessing and rapid-omitting) is present in even moderate amounts. Thus, identifying these disengaged test-takers is important so that statistical or behavior mediation practices can be put into place. The first step in identifying disengaged examinees is to determine the divide between the engaged examinees and those that are disengaged. Many methods for discriminating between disengaged from engaged answering behavior on computer generated assessments exist. Many methods require expertise in statistical methods, while others require little to no mathematics knowledge. However, these simpler methods are routinely used with little reason other than convenience. There has been no comprehensive comparison of the simple, commonly used threshold determination methods. Furthermore, few studies exist that examine threshold determination methods for different question item types or language groups. Therefore, this research presents an empirical comparison of thirteen simple and commonly used threshold methods across three item types and fifteen countries comprising tests administered in eight languages. The assessment data used was the 183 science items from the 2015 PISA. Efficacy of the threshold determination methods use the criteria proposed by Wise (2019) with additional statistical examination of the mean threshold values for each method being performed. Results indicate that there is not one threshold determination method that produces an effective threshold when the answering data is aggregated by language and item type. Additionally, although still important, country and language group contribute less variation to the resultant threshold values for any method than does item type. Furthermore, the threshold determination methods that utilize more information in their calculations (i.e., using response time distributions and accuracy) are more efficacious than those that use no or little information (i.e., fixed thresholds or those using a percentage of the average response times). Although, it is not clear as to which method is most appropriate to use for any given situation. Thus, each testing situation should be considered singly with consideration of item type and country/language characteristics before choosing a threshold determination method. However, two methods, the Proportion Correct >0% Sustained and the Change in Information with Accuracy, both rate high on four out of the five criteria used to evaluate methods. They fall short in that they do not produce a threshold value for every item in the assessment. In the end, the suggestion from this research is that a hybrid method might be developed. This could be done by combining one of the above methods with a method, such as a normative threshold, that is able to produce a threshold for every test item but might not be as reliable based on the criteria used to measure the threshold’s effectiveness.

Access Setting

Dissertation-Abstract Only

Restricted to Campus until

12-1-2025

This document is currently not available here.

Share

COinS