Reliability and Accuracy | Avant Assessment

Overview

Accurate and reliable scores are essential in language proficiency testing. The 阅读 and 倾听 sections of STAMP are multiple-choice, which allows for automatic scoring. The 撰写 and 讲话allow open-ended responses, which involves human raters.

Automatically Scored

Human Rated*

*This research used only human-scored responses, preceding Avant’s automated grading system.

For this reason, Avant is committed to making sure our rater accuracy is as high as possible.

For this reason, Avant is committed to making sure our rating accuracy and agreement between different raters are as high as possible.

A recent analysis of over 23,000 responses for 撰写 and 讲话 ratings for five STAMP 4S languages (Arabic, Spanish, French, Simplified Chinese, and Russian) and three STAMP WS languages (Amharic, Haitian Creole, and Vietnamese) found that Avant’s raters demonstrated high scoring accuracy and inter-rater agreement , making the scores awarded in the Writing and Speaking sections of STAMP accurate and reliable for their intended purposes.

How Writing and Speaking Proficiency is Scored

The study examined the 撰写 and 讲话 sections of STAMP, scored by trained raters using STAMP levels from 0 (No Proficiency) to 8 (Advanced-Mid).

Examinees respond to three real-world prompts, showcasing their skills. Certified raters independently score each response, backed by rigorous training and ongoing monitoring to ensure accuracy and consistency.

For 80% of responses, a single rater’s score is official. For the remaining 20%, at least two raters score the response, with a manager resolving disagreements. Raters work independently, ensuring unbiased results. The final 撰写 or 讲话 scores seen in our reports reflects the highest level consistently demonstrated across at least two of the three prompts.

The chart below illustrates this process:

Figure 1. System rules for arriving at an examinee’s final STAMP level for the Writing and Speaking sections

STAMP accuracy scoring procedures chart. — Figure 1. System rules for arriving at an examinee’s final STAMP level for the Writing and Speaking sections

As shown in Figure 1, an examinee’s official STAMP level is determined by the highest level they can consistently demonstrate in at least two out of three responses. For example, if an examinee receives 初级中级 for their first response, 初级高级 for their second, and 初级高级 for their third, their final STAMP level is STAMP 3 (Novice-High). Alternatively, if they receive 中级低水平 for the first response, 初级高级 for the second, and 中级中等 for the third, their final level is 中级低水平, as it is the highest level they sustained twice (in the first and third responses).

Using three independent prompts in both the Writing and Speaking sections of STAMP has two main benefits:

Broader Topic Coverage: Assessing examinees across different topics ensures that the awarded proficiency level is more likely to generalize to other real-world situations.
Minimizing Rater Bias: Coupled with the scoring method, using multiple prompts helps reduce potential rating bias from individual raters.

Next, we will discuss the definitions of reliability and accuracy.

Reliability vs Accuracy

Figure 2: The difference between reliability and accuracy. Ideally, tests should be both reliable and accurate, as this ensures the validity of the scores for their intended use and interpretation

Reliability

Reliability refers to the consistency of measurement (Bachman & Palmer, 1996). In simple terms, it is how much we can trust that the test scores will remain the same if an examinee takes the test again at different times or takes different versions of the test, assuming their proficiency has not changed.

For example, if an examinee scores 中级低水平 today and 中级高级 tomorrow, without any change in their knowledge or mental state, it suggests the test may not be highly reliable. Similarly, if an examinee scores 高级-低级 on one version of a test and 中级中等 on another, it indicates a lack of consistency, pointing to an issue with the test’s reliability.

One factor contributing to a test’s reliability is how it is scored. In the STAMP test, the 阅读 and 倾听 sections are made up of multiple-choice questions that are scored automatically by a computer. This ensures that if an examinee provides the same answers on different occasions, they will always receive the same score.

However, the 撰写 and 讲话 sections involve scoring by human raters. This means that scores can vary depending on who rates the response. However, with well-trained raters, we expect score variations to be minimal, reducing the impact of leniency, strictness, or potential bias.

Accuracy

Examinees expect their scores to reflect only their proficiency in the construct being measured (in STAMP, proficiency in each language domain).

Accuracy refers to how well the awarded score represents an examinee’s true ability. For example, if an examinee submits a Speaking response at the 中级高级 level but receives an 中级低水平 score from two raters, the score is inaccurate. Even if two other raters assign 中级低水平 two months later, the score remains inaccurate, although it is reliable (since it is consistent across raters and over time).

Evaluating Rater Score Reliability and Accuracy

When responses are scored by human raters, as in the case of STAMP, it’s crucial to ensure that scores reflect the quality of the response itself, not the characteristics of the rater. In other words, scores should depend solely on the examinee’s demonstrated proficiency, not on rater leniency, strictness, or bias.

Language test providers often use statistics to show how much scores may vary based on the rater. Typically, this involves comparing ratings from two separate raters on the same response. Ideally, raters should agree as often as possible, which indicates a reliable scoring process.

However, reliability must also be accompanied by accuracy. Two raters may assign the same score, but both could be incorrect. In a well-developed test, the goal is for raters to consistently agree and be accurate in their scoring.

Perfect agreement between human raters is not always realistic. Despite training and expertise, even qualified raters may disagree at times—just like doctors, engineers, or scientists. The aim is to achieve high agreement that is defensible given the intended use of the scores.

Below are the statistical measures we use at Avant to evaluate the quality of ratings provided by our raters. While many companies report only exact and adjacent agreement, we assess additional measures to get a comprehensive view of rating quality. The measures reported in this paper include:

Exact Agreement:

This measure is reported as a percentage that indicates the percentage of times, across the entire dataset analyzed, when the level awarded to a given response by Rater 1 is exactly the same as the level awarded by Rater 2. For example, if Rater 1 awards a STAMP level 5 to a response and Rater 2 also awards a STAMP level 5 to that same response, that would be considered an instance of exact agreement. Feldt and Brennan (1989) suggest that when two raters are used, there should be an exact agreement of at least 80%, with 70% being considered acceptable for operational use.

This measure is reported as a percentage, showing how often Rater 1 and Rater 2 assigned the same level to a response across the entire dataset. For example, if both raters assign a STAMP level 5 to the same response, it counts as an instance of exact agreement. According to Feldt and Brennan (1989), exact agreement should be at least 80%, with 70% considered acceptable for operational use.

This same measure can also be used to compare the score assigned by Rater 1 to the official score a response receives after being rated by at least two raters. This is the case employed in the Overview Chart below.

Exact + Adjacent Agreement:

This measure is reported as a percentage showing how often Rater 1 and Rater 2 assigned either the same level or an adjacent level to a response across the entire dataset.

For example, STAMP level 5 is adjacent to level 4 and level 6. If Rater 1 assigns level 4 and Rater 2 assigns level 5, it counts towards this measure because the levels are adjacent. According to Graham et al. (2012), when a rating scale has more than 5-7 levels, as with the STAMP scale, the exact + adjacent agreement should be close to 90%.

Quadratic weighted kappa (QWK)

Cohen’s kappa (𝜅) measures reliability between two raters while accounting for the possibility of agreement by chance. For example, with the 9-point STAMP scale (from level 0 to level 8), there is an 11.11% chance that two raters would agree on a score purely by chance. At Avant, we also use quadratic weights when calculating kappa, meaning higher penalties are given to larger discrepancies between scores. For instance, a difference between STAMP level 3 and level 7 is more problematic than a difference between level 3 and level 4.

Williamson et al. (2012) recommend that quadratically weighted kappa (QWK) should be ≥ 0.70, while Fleiss (2003) notes that values above 0.75 indicate excellent agreement beyond chance. A QWK value of 0 means agreement is purely by chance, whereas a value of 1 indicates perfect agreement.

Standardized Mean Difference (SMD)

This measure shows how similarly two raters use a rating scale. It compares the difference in the mean of two sets of scores (Rater 1 vs. Rater 2), standardized by the pooled standard deviation of those scores. Ideally, neither rater should favor or avoid certain levels on the scale (e.g., avoiding STAMP 0 or STAMP 8). In other words, both raters should use the full range of the scale (STAMP 0 – STAMP 8), with scores reflecting the proficiency demonstrated in the response. The recommended value for this measure is ≤ 0.15 (Williamson et al., 2012), indicating that the distributions of both sets of scores are acceptably similar.

Spearman’s Rank-Order Correlation (ρ)

This measure indicates the strength of the association between two variables: the STAMP level assigned by Rater 1 and the level assigned by Rater 2. If raters are well-trained and understand the rating rubric, we expect both raters to assign similar levels—meaning the scores should move together. In other words, when Rater 1 assigns a high level, Rater 2 should also assign a high level, reflecting consistent evaluation of the same construct.

We use Spearman’s rank-order correlation coefficient instead of Pearson’s because Spearman’s is better suited for ordinal data, like STAMP proficiency levels. A correlation coefficient of 0.80 or above is considered strong in most fields (Akoglu, 2018).

2 STAMP Levels Apart

This measure, expressed as a percentage, shows how often two ratings for the same response differ by 2 STAMP levels (e.g., Rater 1 assigns STAMP level 4 and Rater 2 assigns STAMP level 6).

Overview Chart

Bar chart titled 'Avant STAMP Score Accuracy Overview' showing Exact and Exact + Adjacent Agreement percentages for STAMP 4S and STAMP WS writing and speaking. Writing: Exact = 86.6% / 94.9%, Exact + Adjacent = 99.6% / 99.7%. Speaking: Exact = 83.2% / 97%, Exact + Adjacent = 99.3% / 99.9%. Reference lines mark Acceptable (70–75%) and Desirable (80–90%) thresholds. — Chart showing the high accuracy of Avant Raters for the Writing and Speaking sections.

Detailed Score Statistics

We now focus on the quality of the ratings for the 撰写 and 讲话 sections of STAMP 4S and STAMP WS, considering the statistics above across several representative languages. Below, we present results based on two different sets of comparisons:

Rater 1 vs Rater 2

We compare the STAMP level awarded by Rater 1 to the level awarded by Rater 2 across numerous responses rated by at least two raters. This comparison supports the reliability of ratings from two randomly assigned Avant raters. As noted earlier, two raters may agree on a score, but both could still be incorrect. Therefore, we do not include exact agreement measures between Rater 1 and Rater 2. Instead, we focus on Exact + Adjacent Agreement and report accuracy measures comparing scores from Rater 1 (who rates solo 80% of the time) with the official scores.

Rater 1 vs Official Score

To assess the accuracy of the levels assigned by Avant raters, we analyze instances where a response was rated by two or more raters. We compare the official score (derived from all individual ratings) to the score given by Rater 1 alone. This helps indicate how accurately a response is rated when only one rater is involved, which occurs 80% of the time.

Tables 1 and 2 present the statistical measures for the 撰写 and 讲话 sections of five representative STAMP 4S languages.

Table 1 – Writing Score Accuracy (STAMP 4S)

Measure	阿拉伯语	西班牙语	法语	Chinese Simplified	俄罗斯
Number of Responses in Dataset	n = 3,703	n = 4,758	n = 4,785	n = 4,766	n = 3,536
Exact Agreement (Rater 1 vs. Official Score)	(84.8%)	(84.15%)	(83.66%)	(88.46%)	(92.17%)
Exact + Adjacent Agreement Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	96.78% (98.62%)	99.09% (99.79%)	99.22% (99.79%)	99.79% (99.91%)	99.71% (99.88%)
Quadratic Weight Kappa (QWK) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.93 (0.96)	0.91 (0.95)	0.91 (0.95)	0.95 (0.96)	0.95 (0.97)
Standardized Mean Difference (SMD): Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.00 (0.01)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
Spearman’s Rank-Order Correlation ®: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.94 (0.96)	0.90 (0.95)	0.91 (0.95)	0.95 (0.97)	0.94 (0.97)
2 STAMP Levels Apart: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	2.80% (1.24%)	0.90% (0.20%)	0.77% (0.20%)	0.00% (0.00%)	0.28% (0.11%)

Table 1. Rater Reliability and Accuracy Statistics for the Writing Section of Five Representative STAMP 4S Languages.

Table 2 – Speaking Score Accuracy (STAMP 4S)

Measure	阿拉伯语	西班牙语	法语	Chinese Simplified	俄罗斯
Number of Responses in Dataset	n = 3,363	n = 4,078	n = 4,530	n = 4,651	n = 3,392
Exact Agreement (Rater 1 vs. Official Score)	(84.96%)	(80.37%)	(80.19%)	(82.24%)	(88.30%)
Exact + Adjacent Agreement: Rater 1 vs. Rater (Rater 1 vs. Official Score)	96.07% (98.13%)	98.13% (99.29%)	98.54% (99.47%)	99.31% (99.76%)	98.99% (99.94%)
Quadratic Weight Kappa (QWK): Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.92 (0.95)	0.92 (0.96)	0.91 (0.95)	0.94 (0.95)	0.92 (0.96)
Standardized Mean Difference (SMD): Rater 1 vs. Rater 2 (Rater 1 vs. Official )	-0.02 (0.01)	0.00 (0.00)	-0.01 (0.02)	0.00 (0.00)	-0.01 (-0.01)
Spearman’s Rank-Order Correlation®: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.93 (0.96)	0.91 (0.95)	0.92 (0.95)	0.94 (0.96)	0.91 (0.95)
2 STAMP Levels Apart: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	3.27% (1.42%)	1.74% (0.00%)	1.39% (0.00%)	0.00% (0.00%)	1.01% (0.00%)

Table 2. Rater Reliability and Accuracy Statistics for the Speaking Section of Five Representative STAMP

Tables 3 and 4 show the statistical measures for the Writing and Speaking sections of three representative STAMP WS languages.

Table 3 Writing Score Accuracy (STAMP WS)

Measure	阿姆哈拉语	海地克里奥尔语	越南语
Number of Responses in Dataset	n = 209	n = 125	n = 1,542
Exact Agreement (Rater 1 vs. Official Score)	95.79%	94.69%	94.38%
Exact + Adjacent Agreement Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	99.52% (100%)	97.60% (100%)	98.57% (99.02%)
Quadratic Weighted Kappa (QWK) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.98 (0.99)	0.97 (0.99)	0.96 (0.97)
Standardized Mean Difference (SMD) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	-0.01 (0.00)	0.02 (-0.02)	-0.01 (0.01)
Spearman’s Rank-Order Correlation®: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.98 (0.99)	0.97 (0.99)	0.97 (0.98)
2 STAMP Levels Apart Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.00% (0.00%)	2.40% (0.00%)	0.00% (0.00%)

Table 3. Rater Reliability and Accuracy Statistics for the Writing Section of Three Representative STAMP WS Languages.

STAMP accuracy table 3. — Table 3. Rater Reliability and Accuracy Statistics for the Writing Section of Three Representative STAMP WS Languages.

Table 4 Speaking Score Accuracy (STAMP WS)

Measure	阿姆哈拉语	海地克里奥尔语	越南语
Number of Responses in Dataset	n = 225	n = 132	n = 1,180
Exact Agreement (Rater 1 vs. Official Score)	(96.21%)	(97.91%)	(97.01%)
Exact + Adjacent Agreement Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	100% (100%)	100% (100%)	99.83% (99.83%)
Quadratic Weighted Kappa (QWK) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.99 (0.99)	0.99 (0.99)	0.99 (0.98)
Standardized Mean Difference (SMD) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.00 (0.00)	0.00 (0.00)	0.00 (0.01)
Spearman’s Rank-Order Correlation® Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.99 (0.99)	0.99 (0.99)	0.98 (0.99)
2 STAMP Levels Apart (Rater 1 vs. Rater 2 (Rater 1 vs. Official Score)	0.00% (0.00%)	0.00% (0.00%)	0.00% (0.00%)

Table 4. Rater Reliability and Accuracy Statistics for the Speaking Section of Three Representative STAMP WS Languages.

STAMP accuracy table 4 — Table 4. Rater Reliability and Accuracy Statistics for the Speaking Section of Three Representative STAMP WS Languages.

Discussion

A high level of reliability and accuracy is fundamental to the validity of test scores and their intended uses. What is deemed minimally acceptable in terms of reliability and accuracy will however, depend on the specific field (medicine, law, sports, forensics, language testing, etc), as well as on the consequences of awarding an inaccurate level to a specific examinee’s set of responses, and on the rating scale itself. For example, agreement will tend to be lower the higher the number of categories available in a rating scale. In other words, more disagreement between any two raters can be expected if they must assign one of ten possible levels to a response than if they must assign one of only four possible levels.

The statistics seen above for the Writing and Speaking sections of both STAMP 4S and STAMP WS show a high level of both reliability (Rater 1 vs. Rater 2 scores) and accuracy (Rater 1 vs. Official Scores).

Of the eight languages evaluated, the reliability seen by Exact + Adjacent Agreement between Rater 1 and Rater 2 is always at a minimum (and often considerably higher) of 96.78% for Writing and 96.07% for Speaking.

Additionally, cases in which the ratings by two raters were more than two STAMP levels apart were very seldom observed. The level of accuracy for all eight languages, seen by the Exact Agreement statistics between Rater 1’s score and the Official score for each response is always at a minimum of 83.66% (but often considerably higher) for Writing and 80.19% for Speaking, with Exact + Adjacent Agreement always at a minimum of 98.62% for Writing and 98.13% for Speaking. The values for Quadratic Weighted Kappa (QWK) show a very high level of agreement between both Rater 1 vs. Rater 2 and between Rater 1 vs. Official Scores, while the correlation between Rater 1 and Rater 2 scores, as well as between Rater 1 and Official Scores,have been shown to be very high. Finally, the SMD (Standardized Mean Differences) coefficients show that the STAMP scale is being used in a very similar fashion by Avant raters.

The statistics above provide evidence of the high quality of the rater selection and training program at Avant Assessment and of our methodology in identifying operational raters who may need to be temporarily removed from the pool of raters and given targeted training. It shows that when any two raters may differ in the STAMP level assigned to a response, the difference will rarely be of more than 1 STAMP level, with both raters assigning the exact same level in the great majority of cases. Coupled with the fact that an examinee’s final, official score in either the Writing or Speaking section of STAMP is based on their individual STAMP scores across three independent prompts.

The results herein provide strong evidence that an examinee’s final score for the Writing and Speaking sections of STAMP can be trusted to be a reliable and accurate representation of their level of language proficiency in these two domains.

References

Akoglu, H. (2018). User’s guide to correlation coefficients. Turkish journal of emergency medicine, 18(3), 91-93.

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice: Designing and developing useful language tests (Vol. 1). Oxford University Press.

Feldt, L. S., & Brennan, R. (1989). Reliability. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105–146). New York: Macmillan.

Fleiss, J. L., Levin, B., & Paik, M. C. (2003). Statistical methods for rates and proportions. 3rd ed. Wiley.

Graham, M., Milanowski, A., & Miller, J. (2012). Measuring and Promoting Inter-Rater Agreement of
Teacher and Principal Performance Ratings.

Matrix Education (2022). Physics Practical Skills Part 2: Validity, Reliability and Accuracy of Experiments. Retrieved on August 11, 2022 (click here to go to source).

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated
scoring. Educational measurement: issues and practice, 31(1), 2-13.

更新：十月 2025