Test performance 2007
Each year, multiple versions of each of the six IELTS modules
(Listening, Academic Reading, General Training Reading, Academic
Writing, General Training Writing, and Speaking) are released for
use by centres testing IELTS candidates. Reliability estimates for
the objectively and subjectively scored modules used in 2007 are
reported here.
Reliability of objectively-scored modules (Reading and
Listening)
The reliability of Listening and Reading tests is reported using
Cronbach's alpha, a reliability estimate which measures the
internal consistency of the 40-item test. The following Listening
and Reading material released in 2007 had sufficient candidate
responses to estimate and report meaningful reliability values as
follows:
|
Modules
All Academic and General Training
versions
|
Alpha by version |
| Listening version 310 |
0.89 |
| Listening version 311 |
0.89 |
| Listening version 312 |
0.88 |
| Listening version 313 |
0.91 |
| Listening version 314 |
0.87 |
| Listening version 315 |
0.89 |
| Listening version 316 |
0.88 |
| Listening version 317 |
0.88 |
| Listening version 318 |
0.89 |
| Listening version 319 |
0.90 |
| Listening version 320 |
0.89 |
| Listening version 321 |
0.89 |
| Listening version 322 |
0.90 |
| Listening version 323 |
0.90 |
| Listening version 324 |
0.88 |
| Listening version 325 |
0.87 |
| Average Alpha Across versions |
0.89 |
| Average alpha across versions
|
Alpha by version |
| Academic Reading version 310 |
0.85 |
| Academic Reading version 311 |
0.85 |
| Academic Reading version 312 |
0.87 |
| Academic Reading version 313 |
0.87 |
| Academic Reading version 314 |
0.84 |
| Academic Reading version 315 |
0.85 |
| Academic Reading version 316 |
0.88 |
| Academic Reading version 317 |
0.85 |
| Academic Reading version 318 |
0.83 |
| Academic Reading version 319 |
0.90 |
| Academic Reading version 320 |
0.87 |
| Academic Reading version 321 |
0.88 |
| Academic Reading version 322 |
0.89 |
| Academic Reading version 323 |
0.84 |
| Academic Reading version 324 |
0.84 |
| Academic Reading version 325 |
0.90 |
| Average Alpha Across versions |
0.86 |
| Average alpha across versions
|
Alpha by version |
| General Training version 280 |
0.91 |
| General Training version 281 |
0.91 |
| General Training version 282 |
0.88 |
| General Training version 283 |
0.90 |
| General Training version 284 |
0.90 |
| General Training version 285 |
0.85 |
| General Training version 286 |
0.88 |
| General Training version 287 |
0.90 |
| General Training version 288 |
0.88 |
| General Training version 289 |
0.86 |
| General Training version 290 |
0.89 |
| General Training version 291 |
0.89 |
| General Training version 292 |
0.88 |
| General Training version 293 |
0.89 |
| General Training version 294 |
0.91 |
| Average Alpha Across versions |
0.89 |
The figures reported for Listening and Reading modules indicate
the expected levels of reliability for tests containing 40 items.
On the basis of these reliability figures, an estimate of the
standard error of measurement (SEM) may be calculated for these
modules using the following formula:

s t is the standard deviation
of the test
r xx' is the reliability of the test
Table 1 Mean, standard deviation and standard
error of measurement of Listening and Reading
| |
Mean |
Standard deviation |
SEM |
|
Listening |
6.06 |
1.21 |
0.40 |
| Academic
Reading |
5.98 |
1.06 |
0.39 |
| General
Training Reading |
5.59 |
1.26 |
0.41 |
The SEM should be interpreted in terms of the final band scores
reported for Listening and Reading modules (which are reported in
half-bands).
Reliability of subjectively-scored modules (Writing and
Speaking)
The reliability of the Writing and Speaking modules cannot be
reported in the same manner as for Reading/Listening because they
are not item-based; candidates' writing and speaking performances
are rated by trained and standardised examiners according to
detailed descriptive criteria and rating scales. The assessment
criteria used for rating Writing and Speaking performance are
described in the IELTS 2006 Handbook. Benchmarked example writing
performances and CD-based speaking performances at different levels
can be found, along with examiner comments, in the IELTS official
practice materials which can be ordered from the IELTS website.
User-oriented band descriptors describing levels of Writing and
Speaking performance are also available on the website. In
addition, a new DVD “IELTS Scores Explained” provides information
specifically tailored to organizations wanting a detailed
description of IELTS scores. This information helps in setting
appropriate standards of English proficiency. Click here for more information
Reliability of rating is assured through the face-to-face
training and certification of examiners and all must undergo a
retraining and recertification process every two years. A
Professional Support Network (PSN) manages and standardizes the
examiner cadre, including face to face examiner monitoring as well
as distance monitoring (using recordings of the Speaking tests). A
‘jagged profile’ system maintains a further check on the global
reliability of IELTS performance assessment. Routine targeted
double marking identifies the level of divergence (i.e., jagged
profile) between Writing and/or Speaking scores and Reading and
Listening scores. This process allows for the identification of
possible misclassified candidates. The jagged profile system is
also combined with ‘Targeted sample monitoring’ to further identify
possible faulty ratings by examiners. Selected centres worldwide
are required to provide a sample of examiners' marked tapes and
scripts. Tapes and scripts are then second-marked by a team of
IELTS Principal Examiners and assistant Principal Examiners.
Principal Examiners monitor for quality of both test conduct and
rating, and feedback is returned to each test centre. The outcomes
that emerge from these reliability measures feed back into examiner
retraining and continually build on quality management and
assurance systems for IELTS.
Experimental generalisability studies were also carried out as
part of the IELTS Speaking Revision Project (1998-2001) and the
IELTS Writing Revision Project (2001-2005). The study conducted for
the Speaking Revision produced an inter-rater correlation of 0.77,
and a g-coefficient of 0.86 for the operational single-rater
condition (see article in Research Notes 4); the Writing Revision
study produced an inter-rater correlation of 0.77 and
g-coefficients of 0.85-0.93 for the operational single-rater
condition (see Research Notes 16: IELTS writing: Revising
assessment criteria and scales, Phase 3).
From 2008 it is expected that Speaking tests will be digitally
recorded by IELTS centres worldwide. Cambridge ESOL has been
undertaking research into the use of digital audio technology in
speaking assessment for several years, including the feasibility of
such technology for double marking of speaking testes. A recent
study (2006) from the Digital Audio Project investigated partial
double-marking of IELTS Speaking tests in live conditions. Partial
rating presupposes that candidate performance in one or more parts
of the Speaking test correlates adequately with performance in the
Speaking test overall. The results indicated that Part 3 of the
test provided the best correlation between marks on the full test
and marks on a test part. Further empirical studies from the
Digital Audio Project are currently examining the potential for
partial double marking to provide a reliable indicator of fairness
and quality assurance of the IELTS Speaking test.
Performance of test materials in the Writing and Speaking
modules is routinely analysed to check on the comparability of
different test versions and to ensure any variation is within the
acceptable limit. Mean bandscores for the Academic Writing versions
released in 2006, and for which a sufficient sample size has been
obtained, ranged from 5.31 to 6.07. Mean bandscores for the General
Training Writing versions released in 2006 ranged from 5.53 to
5.85. Mean bandscores for Speaking versions released in 2006 ranged
from 5.55 to 6.30.
Reporting IELTS Composite Reliability
The IELTS exam contains four components upon which an overall
band score is awarded. Thus an estimate of composite reliability
offers a useful measure for overall test reliability. Approaches to
estimating the reliability of a composite test are discussed in
Feldt & Brennan (1989: 117)1 and Crocker & Algina (1986:
119-121)2.The method used here is taken from Feldt & Brennan
(1989).
Composite reliability estimates were carried out from the period
1st January to 20 December, 2004. To generate an appropriately
cautious estimate, minimum alpha values were used for the
objectively marked papers; and g-coefficients for the single rater
condition on subjectively marked scores. The composite reliability
estimate for the Academic module was 0.95 and produced a composite
SEM of 0.21. This finding shows a 95% probability for a candidate’s
true score to fall within less than half a band (0.41) of the
observed score. For General Training the composite reliability was
0.95 with a SEM of 0.23. If average; rather than minimum values;
are used for the objective paper alphas, the reliability for both
Academic and GT versions improves slightly to 0.96.
1 Feldt L.S & Brennan R. L. (1989) Reliability. In Linn
(Ed): Educational Measurement, 3rd Edition. American Council on
Education: Macmillan.
2 Crocker L. & Algina J. (1986) Introduction to
classical and modern test theory. Orlando, FL: Harcourt Brace
Jovanovitch.