1 Evaluation of the Indonesian Scholastic Aptitude Test According to the Rasch Model and Its Paradigm Asrijanty Asril Bachelor in Psychology (Gadjah Mada University, Indonesia) Master in Social Research and Evaluation (Murdoch University, Australia) This thesis is presented for the degree of Doctor of Philosophy of The University of Western Australia Graduate School of Education 2011
3 Abstract This study evaluates a high stakes test, the Indonesian Scholastic Aptitude Test (ISAT) from the perspectives of the Rasch model and its paradigm. This test has been developed by the Center for Educational Assessment (CEA) in Jakarta and has been used as one of the admission tests for undergraduate and postgraduate levels of study in some public universities in Indonesia. The CEA has formed a bank of items which is used to construct different sets of items for different purposes. For this study the data from two different sets of items from the item bank, one administered to students for undergraduate entry, and one for postgraduate entry, were available for analysis. Each test consists of three subtests, called Verbal, Quantitative, and Reasoning, to reflect the capacities they are intended to assess. Firstly, this study examines the internal structure of the subtests by applying the Rasch model and its paradigm. Secondly, this study examines the stability of item bank parameters for the items of the subtests. Thirdly, the predictive validity of the test is examined. The Rasch model can be applied as primarily a statistical model used to model data. However, its use in this thesis goes beyond this narrow focus: rather the Rasch paradigm is used as a framework for the whole study. The case for the model is that the comparisons among persons are invariant with respect to which items are used from a class of relevant items, and that the comparisons among items are invariant with respect to the class of persons. These invariance properties are independent of any particular data set. They are especially important when not all persons can attempt the same items on every occasion, which occurs, for example, when item banks are used. However, data will have these invariant properties only if they fit the model. It follows that data i
4 ii are examined for fit to the model, and that if data do not fit the model, it is the data that need to be examined and a substantive explanation for the misfit sought. The purpose of the examination is a better understanding of the design of the instrument and the variable and context of measurement. It is this perspective that involves the broader Rasch paradigm, not merely the application of the model. In this paradigm validity, reliability and fit of the data to the model are integrated. In this study the test is examined not only according to the Rasch model but also according to the Rasch paradigm. Accordingly, the aspects that were examined in addition to the fit of data to the Rasch model were factors that may affect the validity of responses and inferences, including the accuracy of person and item estimates. General fit to the model included standard checks on evidence of (i) violation of local independence, (ii) differential item functioning, (iii) unidimensionality, and (iv) reliability based on Rasch estimates which also provided evidence of the power of detecting misfit. Less standard aspects included checks on (i) the effects of missing responses, (ii) item difficulty order in relation to the item order in the tests, (iii) targeting of the person and item distributions, (iii) possible information in distractors of multiple choice items, (iv) the presence and accounting of guessing using recent contributions to the study of guessing using the Rasch model, (v) differences in units of measurement in the item bank and in the analyses, and (vi) the comparison of item difficulties from the item bank and from the analyses. Thus from the examination of the data from the above perspectives, a comprehensive understanding of the data and frame of reference was demonstrated. Data for this study consisted of the responses of 440 postgraduate examinees and responses of 833 undergraduate examinees. All items were multiple choice items with five alternatives with one of these being the correct response. For the analysis of the fit
5 iii of data to the Rasch model, all these data were analysed. However, for the analysis of predictive validity, data for only 327 postgraduate examinees and 177 undergraduate examinees were examined. These examinees had been accepted into a university program and academic performance records for these students were available. The undergraduate examinees were located in Economics and Engineering. The postgraduate examinees were located in Life Science, Economics, Law, Literature, Natural Science, Medicine, Psychology and Social Studies. For purposes of predictive validity, a grade point average (GPA) in the first two years of study was used as a criterion. The findings show that in all data sets three different ways of scoring missing responses did not show a significant effect on reliability and item fit. Therefore, missing responses in all data were scored as incorrect responses. This is consistent with how the responses were scored in the selection situation. This scoring system also resulted in a data set with no missing responses which has some advantages in this analysis. It is shown that, in general, the items in the test booklet were arranged according to their difficulty from the item bank. However, the difficulties obtained from the data which were analysed were not the same as those of the test booklet. Despite this inconsistency, it was inferred that the ordering of items did not have an impact on the validity and reliability of the test. This is because missing responses had no impact on fit and reliability. The analyses showed that, in general, the internal structure of the undergraduate and postgraduate tests was reasonably consistent with the Rasch model. The items were relatively well targeted and had reasonable power, indicated by the reliability index, to disclose misfit and to differentiate examinees. In all subtests of the ISAT for both the postgraduate and undergraduate tests, there was some misfit to the model. However,
6 iv because misfit was observed in only a few items in each subtest, its effect on reliability was small. The analyses also showed that low or high discrimination, guessing and DIF were evident in some items. Some local dependence, due to the structure of the items, was also evident in all subtests. Dependence between specific items, which was not directly a result of the structure of the test, was observed only between two items in the Quantitative undergraduate set. Information in a distractor was also found in some items. In each case, where an item showed misfit or rescoring was suggested by the statistical analysis, a substantive explanation was sought and provided. Item parameter estimates from the analysis of the postgraduate and undergraduate tests were compared with item parameters from the item bank at the CEA and considerable differences were found. However, using the standard deviations of the same items in the item bank and in the data analysed to assess the relative units in the two contexts, little difference was found between the units in the item bank and in the data analysed. Despite differences in the estimates of the individual item parameters, the person estimates were virtually the same whether item bank parameters were used or parameters from the analysis of the postgraduate/undergraduate test data were used. This is partly because of each of the following (i) the arbitrary origin was adjusted by making the mean difficulties of the items from the item bank zero as in the data analysed, (ii) all students had responses to all items, (iii) the total score in the Rasch model is a sufficient statistic for the person parameter estimate, and (iv) the units were virtually the same. The differences in the relative item difficulties from those of the item bank suggest that frame of reference of the original application and new application is not exactly the same. Further study to understand the instability and regular check for the stability of item bank parameters need to be performed.
7 v In terms of predictive validity, for the postgraduate data, a positive correlation between the GPA and the ISAT estimates was found for most of fields study. However some correlations were not statistically significant and relatively small. Only in three fields of study (Literature, Social Studies and Psychology) was academic performance in the university, as indicated by the GPA, predicted by the ISAT estimates. The variance explained ranged from 11.9 % to 94.2 %. The Verbal subtest was a significant predictor in Literature accounting for 31.4 % of the variance, and the Reasoning subtest was a significant predictor in Social Studies, accounting for 11.9 % of the variance. In Psychology, all the subtests were significant predictors, accounting jointly for 94.2 % variance. However, it was noted that there were only nine students in Psychology, but the high predictive validity was considered worth reporting. In both Economics and Engineering undergraduate studies, the GPA was significantly correlated with all the ISAT estimates. The correlation was consistently higher in Economics than in Engineering despite the standard deviation of the GPA distribution being greater in Engineering than in Economics. When the three subtest estimates were included as predictors in a multiple regression analysis, the variance accounted for was 27.9 % in Economics and 10.4 % in Engineering. The Quantitative subtest predicted better than the other subtests, both in Economics and Engineering. That the positive and significant correlation between ISAT estimates and the GPA was small in some fields and not observed in other fields of study at the postgraduate level can perhaps be explained by the very small range of the GPA in the postgraduate data, especially in some fields such as Medicine. The standard deviation of the GPA in the postgraduate data was approximately half of the standard deviation in the undergraduate data. Therefore, as expected, the correlation between GPA and ISAT estimates was stronger in the undergraduate studies.
8 vi Another factor which needs to be taken into account in interpreting the result of predictive validity analysis is that the sample size in each field of study, especially in the postgraduate data, was very small. This may lead to sampling errors and unstable estimates. This study provides comprehensive evidence of the degree of the broadly defined reliability and validity of the ISAT. It shows that the ISAT met the basic criteria of the Rasch model and that it had some predictive validity in regard to academic performance in postgraduate and undergraduate studies as assessed by correlations with the students GPAs. However, it is necessary to consider further the implications of the differences in the relative difficulties of the item bank and those observed in the data analysed. This study is significant in two ways. Firstly, it contributes to the specific item development process for the ISAT. The results of this study can be used to provide better items and a better test to measure the construct more validly, reliably and efficiently. Secondly, the study contributes to the field of measurement in general by illustrating an application of not only the Rasch model, but the Rasch paradigm, in constructing and evaluating a test. The differences between applying a measurement model within the Rasch paradigm and within a general item response theory (IRT) paradigm is demonstrated.
9 Declaration In accordance with the regulations for presenting thesis and other work in higher degrees, I hereby declare that this thesis is entirely my own work and that it has not been submitted for a degree at this or any other university. I have the permission of my co-author to include the work from the following publication in my thesis. Asril, Asrijanty and Marais, Ida (2011). Applying a Rasch Model Distractor Analysis: Implication for Teaching Learning. In Robert Cavanagh and Russel F. Waugh (Eds), Application of Rasch Measurement for Learning Environments Research (pp ). The Netherlands: Sense publishers. ISBN: (paperback), (hardback). Asrijanty Asril The University of Western Australia August 2011 Note. This thesis has been formatted in accordance with modified American Psychological Association (2010) publication guidelines. vii
10 Table of Contents Abstract... i Declaration... vii Table of Contents... viii Acknowledgements... x List of Acronyms... xi List of Tables... xii List of Figures... xv List of Appendices... xix Chapter 1 Introduction Selection for Higher Education Studies The Indonesian Scholastic Aptitude Test (ISAT) Present Study Significance of the Study Overview of the Dissertation Chapter 2 Literature Review Aptitude Testing for Selection The Rasch Model and Its Paradigm Chapter 3 Methods Rationale and Procedure in Examining Internal Consistency Rationale and Procedure in Examining the Stability of Item Bank Parameters Rationale and Procedure in Examining Predictive Validity ISAT Items Analysed in this Study Chapter 4 Internal Consistency Analysis of the Postgraduate Data Examinees of the Postgraduate Data Internal Consistency Analysis of the Verbal Subtest viii
11 ix 4.3 Internal Consistency Analysis of the Quantitative Subtest Internal Consistency Analysis of the Reasoning Subtest Summary of Internal Consistency Analysis of the Postgraduate Data Chapter 5 Internal Consistency Analysis of the Undergraduate Data Examinees of the Undergraduate Data Treatment of Missing Responses and Item Difficulty Order for the Undergraduate Data Internal Consistency Analysis of the Verbal Subtest Internal Consistency Analysis of the Quantitative Subtest Internal Consistency Analysis of the Reasoning Subtest Summary of Internal Consistency Analysis of the Undergraduate Data Chapter 6 Stability of the Item Bank Parameters in the Postgraduate and Undergraduate Data Correlations between Item Locations Comparisons between Item Locations The Effect of Unstable Item Parameters on Person Measurement Summary Chapter 7 Predictive Validity of the ISAT for Postgraduate and Undergraduate Studies The Predictor and Criterion for the Predictive Validity Analysis Analysis of the Postgraduate Data Analysis of the Undergraduate Data Summary Chapter 8 Discussion and Conclusion Discussion Conclusion References Appendices...277
12 Acknowledgements I would like to express my gratitude to David Andrich for his guidance and continuous support. His understanding and generosity in guiding me made the journey of finishing this study rewarding and enjoyable. This study applies much of his work on the Rasch model. I would also like to thank and to acknowledge the support and constructive input of my co-supervisors, Ida Marais and Stephen Humphry throughout the study. Frequent discussion that we had helped me gain more understanding of Rasch analysis. This study also applies their recent work on the Rasch model. I would like to acknowledge and to thank Irene Styles for reading my thesis. Her suggestion improves the final thesis. The data I used in this study were obtained from the Center for Educational Assessment, Jakarta. I would like to thank to N.Y. Wardani for granting me permission to use the data and all my colleagues in the Center for their support, especially Mbak Tuti, Nana, Irma, Daru, and Yoyok for their assistance in preparing the data. I would like to acknowledge and to thank the Department of Education, Employment, and Workplace Relations (DEEWR) of Australia for providing financial support throughout my studies through the Endeavour Postgraduate Award. Lastly, I would like to thank to my family and friends for their support and encouragement. Special thank goes to Vitti for her assistance in editing my first draft and her support throughout. x
13 List of Acronyms CEA CCC CTT DIF DRM GPA ICC IRT ISAT PRM PSI SNMPTN SPMB TCC Center for Educational Assessment Category Characteristic Curve Classical Test Theory Differential Item Functioning Dichotomous Rasch Model Grade Point Average Item Characteristic Curve Item Response Theory Indonesian Scholastic Aptitude Test Polytomous Rasch Model Person Separation Index National Selection to Enter Public Universities Selection for Admission of New Students Threshold Characteristic Curve xi
14 List of Tables Table 1.1. ISAT Specifications... 7 Table 2.1. Rasch s Two-way Frame of Reference of Objects, Agents and Responses.. 32 Table 3.1. Treatment of Missing Responses for Item Estimates Table 4.1. Composition of Postgraduate Examinees Table 4.2. The Effect of Different Treatments of Missing Responses in Table 4.3. Fit Statistics of Misfitting Items for the Verbal Subtest Table 4.4. Spread Value and the Minimum Value Indicating Dependence Table 4.5. PSIs in Three Analyses to Confirm Dependence in Six Verbal Testlets Table 4.6. Statistics of Some Verbal Items after Tailoring Procedure Table 4.7. Results of Rescoring 17 Verbal Items Table 4.8. Results of Rescoring Four Verbal Items Table 4.9. Results of Rescoring Items 13 and Table Problematic Items in the Verbal Subtest Postgraduate Data Table The Effect of Different Treatments of Missing Responses in the Quantitative Subtest Table Item Difficulty Order in the Quantitative Subtest Table Spread Value and the Minimum Value in the Quantitative Subtest Table PSIs in Three Analyses to Confirm Dependence Table Statistics of Some Quantitative Items after Tailoring Procedure Table Results of Rescoring for 22 Quantitative Items Table Results of Rescoring Three Quantitative Items Table Problematic Items in the Quantitative Subtest Postgraduate Data xii
15 xiii Table The Effect of Different Treatments of Missing Responses in the Reasoning Subtest Table Spread Value and the Minimum Value Indicating Dependence Table PSIs in Three Analyses to Confirm Dependence Table Statistics of Some Reasoning Items after Tailoring Procedure Table Results of Rescoring 19 Reasoning Items Table Result Rescoring for 6 Reasoning Items Table Problematic Items in the Reasoning Subtest Postgraduate Data Table 5.1. Composition of Undergraduate Examinees Table 5.2. Problematic Items in the Verbal Subtest Undergraduate Data Table 5.3. Problematic Items in the Quantitative Subtest Undergraduate Data Table 5.4. Problematic Items in the Reasoning Subtest Undergraduate Data Table 6.1. Correlations between Item locations of the Item Bank and of the Postgraduate/Undergraduate Analyses Table 6.2. Standard Deviation of the Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses Table 6.3. Significance of the Difference in Variance of Item Locations from the Item Bank and from the Postgraduate/Undergraduate Analyses Table 6.4. Identification of Unstable Items without Adjusting the Units for the Verbal Subtest Postgraduate Data Table 6.5. Identification of Unstable Items with Adjusting the Units for the Verbal Subtest Postgraduate Data Table 6.6. The Effect of Adjusting the Units as a Function of a Unit Ratio and Correlation between Item locations of the Item Bank and Postgraduate/Undergraduate Analyses
16 xiv Table 6.7. Comparisons of the Means of Person Locations Using Item Bank Values and Item Estimate from the Postgraduate/Undergraduate Analyses Table 7.1. Number of Examinees who had Academic Records in Each Semester Table 7.2. Descriptive Statistics of ISAT Location Estimates for all Postgraduate Examinees Table 7.3. Descriptive Statistics of the ISAT and the GPA per Field of Study for the Postgraduate Data Table 7.4. Summary of Correlations between Subtests Table 7.5. Correlation between the ISAT and GPA in the Postgraduate Data Table 7.6. Summary of Regression Analyses for the Postgraduate Data Table 7.7. Descriptive Statistics of ISAT Location Estimates for All Undergraduate Examinees Table 7.8. Descriptive Statistics of ISAT and GPA per Field of Study for the Undergraduate Data Table 7.9. Summary of Correlation between Subtests Table Correlation between the ISAT and GPA in the Undergraduate Data Table Summary of Regression Analyses for the Undergraduate Data
17 List of Figures Figure 1.1. ISAT development process... 8 Figure 2.1. ICCs of three items with dichotomous responses Figure 2.2. CCCs and TCCs of an item with three response categories Figure 3.1. ICCs of two items indicating fit (left) and misfit (right) Figure 3.2. Examples of items showing guessing (right) and no guessing (left) Figure 3.3. ICCs of an Item where guessing is confirmed, before tailoring (left) and after tailoring (right) Figure 3.4. ICCs of an item where guessing is not confirmed, before tailoring (left) and after tailoring (right) Figure 3.5. CCCs and TCCs for polytomous responses with three category responses Figure 3.6. Plots of distractors with potential information Figure 3.7. CCC (left) and TCC (right) of an Item showing categories working as intended (top) and not working as intended (bottom) Figure 3.8. An Item Show Uniform DIF Figure 4.1. Item Order of the Verbal subtest according to the location from the item bank (top panel) and from the postgraduate analysis (bottom panel) Figure 4.2. Person-item location distribution for the Verbal subtest Figure 4.3. The ICCs of items 18 and Figure 4.4. The ICC of item 36 indicating guessing graphically Figure 4.5 The plot of item locations from the tailored and anchored analyses for the Verbal subtest xv
18 xvi Figure 4.6. ICCs for item 36 from the original analysis (left) and the anchored all analysis (right) to confirm guessing Figure 4.7. Graphical fit for item Figure 4.8. Graphical fit for item Figure 4.9. Graphical fit for ftem Figure Graphical fit for item Figure The content of item Figure Distractor plot of item Figure The Content of item Figure Distractor plots of item Figure The graphical fit for rescored item 13 into three categories Figure The graphical fit for rescored item 36 into four categories Figure The ICCs of Verbal items indicating DIF for gender, educational level and program of study Figure ICCs for males and females for resolved item Figure ICCs for Masters and doctorates for resolved item Figure ICCs for social sciences and non-social sciences for resolved item Figure Item order of the Quantitative subtest according to the location from the item bank (top) and from the postgraduate analysis (bottom) Figure Person-item location distribution of the Quantitative subtest Figure The ICC of item Figure ICCs of four Quantitative items indicating guessing graphically Figure The plot of tailored and anchored locations for the Quantitative subtest Figure The ICCs from original analysis for four Quantitative items which indicate significant location difference between tailored and anchored analyses but did not indicate guessing from the ICC
19 xvii Figure ICCs of four Quantitative items from the original analysis (left) and anchored all analysis (right) to confirm guessing Figure The Content of four Quantitative items indicate guessing Figure Graphical fit of item Figure Graphical fit of item Figure Graphical fit of item Figure Graphical fit for rescored item 58 only Figure The content of item Figure Distractor plots of item Figure Reasoning item order according to item location from the item bank (top panel) and from postgraduate analysis (bottom panel) Figure Person-item location distribution of the Reasoning subtest Figure ICCs of four Reasoning items indicating guessing graphically Figure The Plot of item locations from the tailored and anchored analyses for the Reasoning subtest Figure The ICC of item Figure The ICCs of four Reasoning items from the original (left) and the anchored all analysis (right) to confirm guessing Figure The content of items 96, 108, 109, and Figure Graphical fit for item Figure Graphical fit for item Figure Graphical fit item Figure Content of items 92, 94, and Figure Distractor plots of items 92, 94, Figure 7.1. Distribution of ISAT location for admitted and non-admitted groups Figure 7.2. Distribution of location estimate in Verbal for each field of study Figure 7.3. Distribution of location estimate in Quantitative for each field of study
20 xviii Figure 7.4. Distribution of location estimate in Reasoning for each field of study Figure 7.5. Distribution of the location estimates in Total for each field of study Figure 7.6. Distribution of the location estimates in GPA for each field of study Figure 7.7. Distribution of ISAT location estimates for sample predictive validity group and other groups Figure 7.8. Distribution of ISAT subtests location estimate and GPA for Economics and Engineering of undergraduate studies
21 List of Appendices Appendix A1. Item Fit Statistics for Verbal (Postgraduate) Subtest Appendix A2. Statistics of Verbal (Postgraduate) Items after Tailoring Procedure Appendix A3. Results of DIF Analysis for Verbal (Postgraduate) Subtest Appendix B1. Item Fit Statistics for Quantitative (Postgraduate) Subtest Appendix B2. Statistics of Quantitative (Postgraduate) Items after Tailoring Procedure Appendix B3. Results of DIF Analysis for Quantitative (Postgraduate) Subtest Appendix C1. Item Fit Statistics Analysis for Reasoning (Postgraduate) Subtest Appendix C2. Statistics of Reasoning (Postgraduate) Items after Tailoring Procedure Appendix C3. Results of DIF Analysis for Reasoning (Postgraduate) Subtest Appendix D1.Treatment of Missing Responses for Verbal Subtest in Undergraduate Data Appendix D2. Item Difficulty Order for Verbal (Undergraduate) Subtest Appendix D3. Targeting and Reliability for Verbal (Undergraduate) Subtest Appendix D4. Item Fit Statistics for Verbal (Undergraduate) Subtest Appendix D5. Local Independence in Verbal Subtest of Undergraduate Data Appendix D6. Evidence of Guessing in Verbal Subtest of Undergraduate Data Appendix D7. Distractor Information in Verbal Subtest of Undergraduate Data Appendix D8. Results of DIF Analysis for Verbal (Undergraduate) Subtest Appendix E1.Treatment of Missing Responses for Quantitative Subtest of Undergraduate Data Appendix E2. Item Difficulty Order for Quantitative (Undergraduate) Subtest xix
22 xx Appendix E3.Targeting and Reliability for Quantitative (Undergraduate) Subtest Appendix E4. Item Fit Statistics for Quantitative Subtest of Undergraduate Data Appendix E5. Local Independence in Quantitative Subtest of Undergraduate Data Appendix E6. Evidence of Guessing in Quantitative Subtest of Undergraduate Data Appendix E7. Distractor Information for Quantitative Subtest of Undergraduate Data Appendix E8. Results of DIF Analysis for Quantitative Subtest of Undergraduate Data Appendix E9. Content of Problematic Items in Quantitative (Undergraduate) Subtest Appendix F1. Treatment of Missing Responses for Reasoning Subtest of Undergraduate Data Appendix F2. Item Difficulty Order for Reasoning Subtest of Undergraduate Data Appendix F3. Targeting and Reliability for Reasoning Subtest of Undergraduate Data Appendix F4. Item Fit Statistics for Reasoning Subtest of Undergraduate Data Appendix F5. Local Independence in Reasoning Subtest of Undergraduate Data Appendix F6. Evidence of Guessing in Reasoning Subtest of Undergraduate Data Appendix F7. Distractor Information in Reasoning Subtest of Undergraduate Data Appendix F8. Results of DIF Analysis for Reasoning Subtest of Undergraduate Data Appendix F9.Content of Problematic Items in Reasoning (Undergraduate) Subtest Appendix G1.Correlations between Item Location from the Item Bank and from Postgraduate Analysis
23 xxi Appendix G2. Correlations between Item Location from the Item Bank and from Undergraduate Analysis Appendix G3. Identification of Unstable Items after Adjusting the Units in Postgraduate Data Appendix G4. Identification of Unstable Items after Adjusting the Units in Undergraduate Data Appendix G5. Correlations between Person Location from the Item Bank and from Postgraduate Analysis Appendix G6. Correlations between Person Location from the Item Bank and from Undergraduate Analysis Appendix H1. Relationship between the ISAT and GPA in Postgraduate Data Appendix H2. The Results of Multiple Regression Analyses for Postgraduate Data Appendix H3. Relationship between the ISAT and GPA in Undergraduate Data Appendix H4. The Results of Multiple Regression Analyses for Undergraduate Data...390
25 Chapter 1 Introduction Selection for entry to higher education is considered an important issue in many countries. There are at least three reasons for its importance. The reasons are that tertiary selection determines the quality of the graduates, that it affects curricula and teaching methods in secondary schools, and that it affects social equity and social cohesion within societies (Harman, 1994). Accordingly, ensuring an admission test is reliable and that the inferences made from test scores are valid becomes crucial. To achieve this, the internal structure of the test and its relation to external criteria need to be examined. In particular, to ensure that the test meets important measurement criteria, an examination based on a model which has properties of fundamental measurement, namely the Rasch model, has advantages compared to other approaches. Andrich (2004) argues that the distinction between the Rasch model and other measurement models, namely item response theory (IRT) models, is not only a distinction between model properties but also between statistical paradigms. The IRT models are used within the traditional statistical paradigm (Andrich, 2004). In the traditional paradigm, the function of a model is to account for the data. Thus, when the data do not fit the model, another model which explains or describes the data better is used. In contrast, in the Rasch paradigm a model serves as a frame of reference. When the data do not fit the Rasch model, the data need to be examined and an explanation of the misfit sought. Thus, the Rasch model serves as a prescriptive and diagnostic tool. 1
26 2 Chapter 1 Applying the Rasch model and its paradigm can help in developing better items to measure a construct more validly, reliably, and efficiently. This study evaluates the Indonesian Scholastic Aptitude Test (ISAT) internally, through the Rasch model and its paradigm, and externally through its predictive validity. In addition, the stability of the estimates of item difficulty relative to the item bank is also examined. The test, developed by the Center for Educational Assessment (CEA) in Jakarta, has been used as one of the admission tests for undergraduate and postgraduate levels in some public universities in Indonesia. However, although it has been analysed and an item bank developed based on the Rasch model, it has not been reviewed comprehensively using the Rasch model and its paradigm. The chapter starts with the context and background of this study. Selection for higher education and the development of the ISAT are discussed first. This is followed by a description of the study, its significance, and an outline of the structure of the dissertation. 1.1 Selection for Higher Education Studies Selection for higher education generally takes place because the number of applicants is greater than the available places. The greater the ratio of applicants to places the more competitive the selection. In Asian countries where the number of applicants is increasing rapidly (Harman, 1994), the competition is inevitably very high. Competition, however, does not occur only in developing countries but also in developed countries. In the United States (US), for example, in general the chance for applicants to enter university (four-year institution) is relatively high. At least threequarters of applicants are admitted to about 65 % of the institutions. Still, the competition in some prestigious colleges is very high (Zwick, 2004). In many of these
27 Introduction 3 countries there is strong competition for particular professional studies, for example, Law and Medicine. Higher education institutions differ in how they select students. However, in general, variation in selection method originates from three sources, namely evidence of applicants quality, either aptitude or achievement; reference of assessment, either criterion-based assessment or norm-based assessment; context of assessment, either secondary school-based assessment or national or external assessment (Fulton, 1992). The issue which attracts much attention is the choice between assessment of aptitude and achievement. Some argue that the basis for selection should be based on the assessment of achievement, not potentiality or aptitude; others consider the assessment of aptitude more relevant. Different countries apply different criteria for selection and these criteria are usually a function of a country s education context. In the US, both achievement and aptitude are used as admission criteria. Most of the US universities accept either a score on the SAT, developed by the College Board New York, which measures reasoning, or a score on the ACT test, developed by the American College Testing IOWA), which measures achievement (Briggs, 2009). In other countries, such as the United Kingdom (UK) and Australia, the criterion of admission is student achievement in prescribed subjects (Andrich & Mercer, 1997). In Indonesia, where public (state) universities are generally preferred to private universities, selection for undergraduate studies into all public universities until 2001 was based on a centralized achievement test as the selection tool. The applicants for all public universities sit for the same admission test at the same time, generally over two days. The subjects that all applicants are tested on are Basic Mathematics, Indonesian, and English. In addition, applicants for Natural Science programs sit for Natural Science subject tests, namely Biology, Chemistry, Physics, Science Mathematics and
28 4 Chapter 1 Applied Natural Science. Those who apply for social science programs sit for social science subject tests including History, Geography, Economics, and Applied Social Science. To study Kinesiology and Arts, applicants are required to take additional tests. From 2002, the system for selection was changed as a consequence of the Ministerial decree 173/U/2001. The decree states that student selection, including criteria and procedures, is set by each university. Nevertheless, there is an agreement among public universities to continue to use the previous system which is centralized, and to use the same criteria. This system selection is called Selection for Admission of New Students (SPMB). However, SPMB is not the only scheme in recruiting students. The universities, especially the prestigious ones, in addition to SPMB, also apply other schemes in recruiting students. These schemes may be different from each other in terms of the criteria and the selection procedures. The criteria may be outstanding performance in an academic national or international competition (for example, Physics or Math Olympiad), outstanding academic performance nominated by the region, outstanding performance in school and in a scholastic aptitude test, outstanding performance in a school with a low socioeconomic background, and outstanding performance in sport and arts. It is clear then that from 2002, especially for some prestigious universities there are schemes in recruiting students for undergraduate studies which in general can be classified into two groups. The first is through SPMB (centralised selection procedure with achievement tests as the selection tool). The second is other than SPMB where in this category the selection procedures and criteria vary. In 2008 the SPMB changed to SNMPTN (National Selection to Enter Public Universities). However, except for the name, the selection system, including the
29 Introduction 5 selection tool, did not change. Only from 2009 has a scholastic aptitude test been added as an admission test to complement the achievement tests. Meanwhile, selection at postgraduate level has never been centralised. Each university sets and applies its own selection system. Although the procedures are different, the criteria are the same. For doctorate programs, three components are generally assessed, namely English, scholastic aptitude, and subject matter. The last component may be assessed from a research proposal, interview, written test or portfolio. For Masters programs, some fields of study use the three components as for the doctorate level or just English and scholastic aptitude. 1.2 The Indonesian Scholastic Aptitude Test (ISAT) The Background As indicated earlier, in the 1980s, in Indonesia selection to enter public universities, for undergraduate level, was based only on performance on an admission test which was an achievement test in some subjects. There had been concern about this selection system. The system was considered as not providing adequate information about an applicant s potential for further study, because it captures only an applicant s knowledge in certain subjects. Some argued that certain students may not perform well in the achievement test for some reason even though they may be capable of succeeding in university studies. For example, applicants from low social and economic backgrounds may not perform well, not because they are incapable of further study, but because they have been disadvantaged in their schooling. Although it is not always the case, there is a trend that students from high social and economic status background attends high quality schools and students from low social and economic status backgrounds attend lower quality
30 6 Chapter 1 schools. Similarly, those who live in big cities (urban areas) tend to get better service in education than those in small cities (rural areas). In remote areas, in particular, the learning process is hindered by limited resources which, in turn, lead to low levels of academic achievement. Also, many students, especially in big cities, attend test preparation courses before sitting for university entrance tests. Some test preparation institutions are well known for their success in helping students get a place in universities. It is suspected that some students get a place in a university due to the drilling process in the preparation program even though their academic ability is relatively low. The CEA, formerly the Research and Development Center for the Examination System, organized a national seminar for student selection methods as a response to these concerns in the late 1980s. One of the recommendations that followed from this seminar was to develop a scholastic aptitude test to be used as one of the selection instruments for higher education admission. It was thought that using a scholastic aptitude test to complement an achievement test would provide a better prediction of future success than an achievement test alone. Since then, the Indonesian Scholastic Aptitude Test (ISAT) has been developed by the CEA Description The ISAT has been developed to measure individual scholastic aptitude or academic capability. This aptitude is considered a significant factor contributing to the success in higher education studies at both undergraduate and postgraduate levels. Therefore, although the idea of developing the ISAT was originally for selection at undergraduate level, during its development it was considered that it would be useful for selection at the postgraduate level as well.
31 Introduction 7 The test consists of three subtests, Verbal, Quantitative, and Reasoning, and uses multiple choice item formats with five alternatives. The Verbal subtest measures reasoning in a verbal context; the Quantitative subtest measures reasoning in a numerical context; the Reasoning subtest measures the ability to draw a conclusion from a hypothetical situation or condition. The details of the test including the sections in each subtest, the number of items, and the time allocated to complete the subtest are shown in Table 1.1. Table 1.1. ISAT Specifications Subtest Section Number of Items Allocated Time Verbal Synonyms Antonyms Analogies Reading Comprehension items 30 min Quantitative Number Sequence Arithmetic & Algebra concepts Geometry items 60 min Reasoning Logic Diagrams Analytical items 40 mins Total 112 items 130 min Test Development As indicated previously, the ISAT has been developed over almost 20 years. In the first years of its development the focus was on the development of the test specifications, the result of which is shown in Table 1.1. In the latter years the focus has been on the development of an item bank. For this purpose each year the CEA organizes activities related to item development, including item writing, item review, item trial, and item analysis.
32 8 Chapter 1 In the item trials, in which the respondents are normally high school students (year 12), each student does not take all three subtests. Only one set of a subtest (about items) is given to each group (class). It takes approximately minutes to complete the test. Some linking items across trial forms are included. Items are then analysed using classical test theory and Rasch measurement theory. Classical item analysis, which is undertaken before Rasch analysis, is conducted to examine how well the items work from the perspective of classical test theory. The main statistic which is used is the item discrimination index. The Rasch analysis is conducted only for items which show a positive discrimination index for the correct answer (key). It may be argued that this step of first using classical test theory is not necessary when applying the Rasch model. However, here the process which is currently used is described. Items which show a negative discrimination index for the key are not included for further analysis. If it is found that these items can be revised, they are retained for retrial. Those items for which an explanation of negative discrimination cannot be offered and could not be revised are dropped. In using Rasch analysis items are examined in terms of fit to the Rasch model, in this case the criterion is the item fit statistic. The steps in ISAT development are summarized in Figure 1.1 Figure 1.1. ISAT development process
33 Introduction Test Administration, Scoring and Reported Results To administer the test, testers need to attend a coaching session and to follow the instruction manual. Normally, it takes about 15 minutes for testing preparation including reading test instructions and filling in the identity details on a computer answer sheet. The testing time is 130 minutes with allocated time for each subtest as described in Table 1.1. The examinees are informed that the ISAT scoring does not apply a penalty for incorrect responses. Each correct answer is scored 1 and each incorrect response is scored 0. A missing response is also scored 0. It is apparent that this scoring system encourages examinees to guess and thus, theoretically, the ISAT data may contain guessed responses. There are four scores reported, Verbal, Quantitative, Reasoning, and the Total. In each subtest a person s proficiency estimate in logits is converted relative to a scale with a mean of 300 and a standard deviation of 40. In this way a score in each subtest ranges approximately between 100 and 500. A total score is obtained by summing the scaled scores on the three subtests. The range of total scores is and is scaled to have a mean of 900 and a standard deviation of Test Usage Although the test has been developed over about 20 years, it has not been used widely until recently. From the early 1990s until the early 2000s, it was used only by one private university as one of its selection instruments. Only since 2004 has the test been employed by some public universities in Indonesia, notably the prestigious ones, as part of their selection tools. The ISAT has been used as a selection instrument for undergraduate and postgraduate levels in some fields of study in different ways. Some universities use the ISAT along
34 10 Chapter 1 with other instruments, such as an achievement test and/or interview, while others may use the ISAT as the only selection tool. The role of the ISAT in the selection process also varies. Some give more weight to the ISAT score than to other scores, and some do not. Some use the ISAT scores for filtering applicants; some use the ISAT scores and other results simultaneously. When the ISAT is used for filtering, generally the cut off score is 900 or above for more selective programs. For security and for aligning students to the difficulties of the items, different item sets are used for different groups of examinees. In terms of security, for example, the same item set would not be administered as an admission test in two different universities where there is a possibility that the examinees could sit both tests. In terms of aligning students to the difficulties of the items, a more difficult set would be given to higher proficiency examinees. However, because the examinees proficiency level in scholastic aptitude is usually not known, the examinees of level proficiency is inferred from the competitive level of the selection system. It is assumed that in more competitive selection systems the proficiency of the examinees is higher than in less competitive systems. A more difficult test is given to examinees in more selective selection procedures. It should be noted that the scholastic aptitude test which has been used in SNMPTN (National Selection to Enter Public Universities) since 2009 is not the ISAT which has been developed by the CEA. The SNMPTN test was prepared by the SNMPTN Committee. 1.3 Present Study As stated earlier the ISAT has been used for 20 years. However, until now, no study has been conducted to examine this test comprehensively, especially based on the Rasch
35 Introduction 11 paradigm of measurement. It is considered critical to examine thoroughly an instrument that serves as such a high stakes test. In addition, because the items used in this study were obtained from an item bank, it is necessary to examine the stability of the item parameters of the test with respect to their item bank values. Although in practice it is assumed that item parameters are invariant over time they may change over time or across different groups. Another area examined is the predictive validity of the test. The ISAT, as described earlier, is used as a selection tool to enter higher education studies. Therefore, the extent to which the test predicts academic performance in higher education studies need to be studied. This can be considered as an effort to build a sound validity argument to support the intended use of the test according to the Standard for Educational and Psychological Testing set by the American Educational Research Association, the American Psychological Association, and the National Council on Educational Measurement (AERA, APA, & NCME, 1999). Therefore, this study examines the validity of the test by examining its internal structure based on the Rasch model and the Rasch paradigm, the stability of the item bank parameters and its predictive validity. For the predictive validity purpose, responses of the examinees on the ISAT and their academic performance in universities are needed. Although the data of the ISAT responses can be obtained from the CEA, academic performance data are available only from the universities. Thus, the predictive validity of the ISAT can be studied only with the cooperation of universities. To provide comprehensive results, it is desirable that data are obtained from examinees from as many fields of study as possible, both at undergraduate and postgraduate levels, and with evidence of their academic performance in universities. Therefore, the
36 12 Chapter 1 universities chosen were those that used the ISAT to select students for various programs of study, had academic performance records for at least one year, and were willing to supply such data. Two years before this study started, that is in 2005, two universities, which will be referred to as A and B, used the ISAT to select students for undergraduate studies for almost all fields of study. In the same year, for postgraduate studies a third university, C, used the ISAT, to select students for postgraduate studies in all fields of study in that university. However, only university A and university C were able to provide data for the academic performance of those who were tested in Although university C was able to provide data of students academic performance for postgraduate studies from all fields of study, university A which had undergraduate data, provided students academic performance from only two fields of study, namely Economics and Engineering. In 2005 university A used the ISAT to select students for undergraduate studies in a special scheme (not SNMPTN). In this scheme students who were in the top ten in their class in their third year of high school could apply to take the test and the ISAT was the only test administered to the applicants. In contrast, in selection for postgraduate studies by university C, the ISAT was not the only admission test. Tests in specific areas were also used. As indicated earlier, in a selection situation, the number of applicants is generally greater than the number who are admitted. In this study, although the number of applicants is known, it is not clear how many applicants were actually admitted. Also not known was the cut score of the ISAT or the role of the ISAT in the admission decisions and whether there were some criteria or considerations in the admission decision other than the admission test results.