skip to main content
Home Teaching and research Test performance 2015

Test performance 2015

Each year, multiple versions of each of the six IELTS modules (Listening, Academic Reading, General Training Reading, Academic Writing, General Training Writing, and Speaking) are released for use by centres testing IELTS candidates. Reliability estimates for the modules used in 2015 are reported below. 

Reliability of Reading and Listening modules


The reliability of Listening and Reading tests is reported using Cronbach's alpha, a reliability estimate which measures the internal consistency of the 40-item test. The following Listening and Reading material released in 2015 had sufficient candidate responses to estimate and report meaningful reliability values as follows:

All Academic and General Training versions

Module (All Academic
and General Training versions)

 

Alpha

Listening version

741

0.911

Listening version

742

0.893

Listening version

743

0.881

Listening version

744

0.92

Listening version

745

0.915

Listening version

746

0.923

Listening version

747

0.918

Listening version

749

0.915

Listening version

752

0.919

Listening version

753

0.918

Listening version

754

0.887

Listening version

755

0.896

Listening version

756

0.895

Listening version

758

0.904

Listening version

759

0.915

Listening version

760

0.882

Listening version

761

0.929

Average Alpha across versions

0.91

Module

 

Alpha

General Training reading version

531

0.916

General Training reading version

532

0.928

General Training reading version

533

0.903

General Training reading version

535

0.917

General Training reading version

536

0.923

General Training reading version

537

0.922

General Training reading version

538

0.92

General Training reading version

539

0.911

General Training reading version

540

0.934

General Training reading version

541

0.915

General Training reading version

542

0.932

General Training reading version

543

0.906

General Training reading version

544

0.928

General Training reading version

545

0.899

General Training reading version

546

0.899

General Training reading version

548

0.922

General Training reading version

549

0.915

General Training reading version

550

0.923

General Training reading version

551

0.916

Average Alpha across versions

 

0.92


Module

 

Alpha

Academic reading version

741

0.92

Academic reading version

742

0.89

Academic reading version

743

0.91

Academic reading version

744

0.9

Academic reading version

745

0.87

Academic reading version

746

0.91

Academic reading version

747

0.91

Academic reading version

748

0.9

Academic reading version

749

0.92

Academic reading version

750

0.91

Academic reading version

751

0.86

Academic reading version

752

0.91

Academic reading version

753

0.92

Academic reading version

754

0.94

Academic reading version

755

0.87

Academic reading version

756

0.91

Academic reading version

757

0.89

Academic reading version

758

0.91

Academic reading version

759

0.92

Academic reading version

760

0.89

Academic reading version

761

0.92

Average Alpha across versions

 

0.90

The figures reported for Listening and Reading modules indicate the expected levels of reliability for tests containing 40 items. On the basis of these reliability figures, an estimate of the standard error of measurement (SEM) may be calculated for these modules using the following formula:
Standard error of measurement formula
st is the standard deviation of the test
rxx' is the reliability of the test

Table 1 Mean, standard deviation and standard error of measurement of Listening and Reading (2015)

Module

Mean

SD

Alpha

SEM

Listening

6.10

1.3

0.92

0.37

ACR

6.02

1.2

0.90

0.38

GTR

6.00

1.5

0.91

0.45

The SEM should be interpreted in terms of the final band scores reported for Listening and Reading components (which are reported in half-bands).

The reliability of the Writing and Speaking modules cannot be reported in the same manner as for Reading/Listening because they are not item-based; test takers' writing and speaking performances are rated by trained and standardised examiners according to detailed descriptive criteria and rating scales. The assessment criteria used for rating Writing and Speaking performance are described in the IELTS 2006 Handbook. Benchmarked example writing performances and CD-based speaking performances at different levels can be found, along with examiner comments, in the IELTS official practice materials which can be ordered from the IELTS website. User-oriented band descriptors describing levels of Writing and Speaking performance are also available on the website. In addition, the "IELTS Scores Explained" DVD provides information specifically tailored to organisations wanting a detailed description of IELTS scores. This information helps in setting appropriate standards of English proficiency. Click here for more information.

Reliability of rating is assured through the face-to-face training and certification of examiners and all must undergo a retraining and recertification process every two years. A Professional Support Network (PSN) manages and standardises the examiner cadre, including face-to-face examiner monitoring as well as distance monitoring (using recordings of the Speaking tests). A 'jagged profile' system maintains a further check on the global reliability of IELTS performance assessment. Routine targeted double-marking identifies the level of divergence (i.e. jagged profile) between Writing and/or Speaking scores and Reading and Listening scores. This process allows for the identification of possible misclassified test takers. The jagged profile system is also combined with 'Targeted sample monitoring' to further identify possible faulty ratings by examiners. Selected centres worldwide are required to provide a sample of examiners' marked tapes and scripts. Tapes and scripts are then second-marked by a team of IELTS Principal Examiners and assistant Principal Examiners. Principal Examiners monitor for quality of both test conduct and rating, and feedback is returned to each test centre. The outcomes that emerge from these reliability measures feed back into examiner retraining and continually build on quality management and assurance systems for IELTS.

Experimental generalisability studies were carried out as part of the IELTS Speaking and Writing Revision Projects to investigate the reliability of ratings (Shaw, 2004; Taylor & Jones, 2001). More recent G-studies based on examiner certification data showed coefficients of 0.83–0.86 for Speaking and 0.81–0.89 for Writing.

The IELTS exam contains four components upon which an overall band score is awarded. Thus an estimate of composite reliability offers a useful measure for overall test reliability. Following Feldt & Brennan (1989), composite reliability estimates were calculated based on test data from 2009. To generate an appropriately cautious estimate, minimum alpha values were used for the objectively marked papers, and G-coefficients for the single rater condition on subjectively marked scores. The composite reliability estimate for both the Academic and General Training modules was 0.96 and produced a composite SEM of 0.23.

References

Feldt, LS & Brennan, RL (1989) Reliability. In RL Linn (Ed), Educational measurement (3rd ed, 105–146). New York: Macmillan
Shaw, SD (2004) IELTS writing: revising assessment criteria and scales (Phase 3). Research Notes 16, 3–7
Taylor, L & Jones, N (2001) Revising the IELTS Speaking Test. Research Notes 4, 9–12