IELTS - Home > Researchers > Analysis of test data > Test performance 2009
 
IELTS | Researchers - Test performance 2009

 

Each year, multiple versions of each of the six IELTS components (Listening, Academic Reading, General Training Reading, Academic Writing, General Training Writing, and Speaking) are released for use by centres testing IELTS candidates. Reliability estimates for the objectively and subjectively scored modules used in 2009 are reported below.


Reliability of objectively-scored components (Reading and Listening)

The reliability of Listening and Reading tests is reported using Cronbach's alpha, a reliability estimate which measures the internal consistency of the 40-item test. The following Listening and Reading material released in 2009 had sufficient candidate responses to estimate and report meaningful reliability values as follows:

 

Component (All Academic and General Training versions)

 

Alpha

Listening version

423

0.92

Listening version

424

0.91

Listening version

425

0.92

Listening version

426

0.92

Listening version

427

0.91

Listening version

428

0.90

Listening version

429

0.92

Listening version

430

0.91

Listening version

431

0.91

Listening version

432

0.90

Listening version

433

0.91

Listening version

434

0.92

Listening version

435

0.91

Listening version

436

0.91

Listening version

437

0.91

Listening version

438

0.92

Average Alpha across versions

 

0.91

 

 

Component

 

Alpha

General Training reading version

327

0.91

General Training reading version

328

0.89

General Training reading version

329

0.88

General Training reading version

330

0.91

General Training reading version

331

0.90

General Training reading version

332

0.91

General Training reading version

333

0.92

General Training reading version

334

0.90

General Training reading version

335

0.87

General Training reading version

336

0.87

General Training reading version

337

0.89

General Training reading version

338

0.90

General Training reading version

339

0.92

General Training reading version

340

0.87

General Training reading version

341

0.92

General Training reading version

342

0.91

Average Alpha across versions

 

0.90

 

 

Component

 

Alpha

Academic reading version

423

0.91

Academic reading version

424

0.87

Academic reading version

425

0.92

Academic reading version

426

0.90

Academic reading version

427

0.89

Academic reading version

428

0.94

Academic reading version

429

0.90

Academic reading version

430

0.89

Academic reading version

431

0.88

Academic reading version

432

0.91

Academic reading version

433

0.90

Academic reading version

434

0.88

Academic reading version

435

0.90

Academic reading version

436

0.90

Academic reading version

437

0.91

Academic reading version

438

0.92

Average Alpha across versions

 

0.90

 

The figures reported for Listening and Reading components indicate the expected levels of reliability for tests containing 40 items. On the basis of these reliability figures, an estimate of the standard error of measurement (SEM) may be calculated for these modules using the following formula:

 

Research formula

 

St is the standard deviation of the test

rxx' is the reliability of the test

 

 

Table 1 Mean, standard deviation and standard error of measurement of Listening and Reading

 

Mean

Standard deviation

SEM

Listening

5.99

1.35

0.40

General Training Reading

5.70

1.42

0.45

Academic Reading

5.81

1.28

0.41

 

The SEM should be interpreted in terms of the final band scores reported for Listening and Reading modules (which are reported in half-bands).

 

Reliability of subjectively-scored components (Writing and Speaking)

The reliability of the Writing and Speaking components cannot be reported in the same manner as for Reading/Listening because they are not item-based; candidates' writing and speaking performances are rated by trained and standardised examiners according to detailed descriptive criteria and rating scales.

 

Click through for the band score descriptors (public version) relating to the assessment of Writing Task 1, Writing Task 2 and Speaking performance.

 

Benchmarked example writing performances and CD-based speaking performances at different levels can be found, along with examiner comments, in the IELTS Official Practice Materials. In addition, the DVD IELTS Scores Explained provides information specifically tailored to organisations wanting a detailed description of IELTS scores. This information helps in setting appropriate standards of English proficiency.

 

Reliability of rating is assured through the face-to-face training and certification of examiners and all must undergo a retraining and recertification process every two years. A Professional Support Network (PSN) manages and standardizes the examiner cadre, including face to face examiner monitoring as well as distance monitoring (using recordings of the Speaking tests). A ‘jagged profile’ system maintains a further check on the global reliability of IELTS performance assessment. Routine targeted double marking identifies the level of divergence (i.e., jagged profile) between Writing and/or Speaking scores and Reading and Listening scores. This process allows for the identification of possible misclassified candidates. The jagged profile system is also combined with ‘Targeted sample monitoring’ to further identify possible faulty ratings by examiners. Selected centres worldwide are required to provide a sample of examiners' marked tapes and scripts. Tapes and scripts are then second-marked by a team of IELTS Principal Examiners and assistant Principal Examiners. Principal Examiners monitor for quality of both test conduct and rating, and feedback is returned to each test centre. The outcomes that emerge from these reliability measures feed back into examiner retraining and continually build on quality management and assurance systems for IELTS.

 

Experimental generalisability studies were carried out as part of the IELTS Speaking and Writing Revision Projects to investigate the reliability of ratings (Shaw, 2004; Taylor & Jones, 2001). More recent G-studies based on examiner certification data showed coefficients of 0.83-0.86 for Speaking and 0.81-0.89 for Writing.

 

The IELTS exam contains four components upon which an overall band score is awarded. Thus an estimate of composite reliability offers a useful measure for overall test reliability. Following Feldt & Brennan (1989), composite reliability estimates were calculated based on test data from 2009. To generate an appropriately cautious estimate, minimum alpha values were used for the objectively marked papers, and G-coefficients for the single rater condition on subjectively marked scores. The composite reliability estimate for the Academic module was 0.95 and produced a composite SEM of 0.22. For General Training, the composite reliability was 0.96 with a SEM of 0.24. If average rather than minimum values are used for the objective paper alphas, composite reliability becomes 0.96 for both versions.

 

References
Feldt, LS & Brennan, RL (1989) Reliability. In RL Linn (Ed), Educational measurement (3rd ed, 105-146). New York: Macmillan
Shaw, SD (2004) IELTS writing: revising assessment criteria and scales (Phase 3). Research Notes 16, 3-7
Taylor, L & Jones, N (2001) Revising the IELTS Speaking Test. Research Notes 4, 9-12

Disclaimer | Legal | Copyright Notice | Privacy Policy