概述
Accurate and reliable scores are essential in language proficiency testing. The Lecture and Écoute sections of STAMP are multiple-choice, which allows for automatic scoring. The Écriture and Parlerallow open-ended responses, which involves human raters.
自动评分
Human Rated*
*This research used only human-scored responses, preceding Avant’s automated grading system.
因此,Avant 致力于确保我们的测评员具有尽可能高的准确性。
For this reason, Avant is committed to making sure our rating accuracy and agreement between different raters are as high as possible.
A recent analysis of over 23,000 responses for Écriture and Parler ratings for five STAMP 4S languages (Arabic, Spanish, French, Simplified Chinese, and Russian) and three STAMP WS languages (Amharic, Haitian Creole, and Vietnamese) found that Avant’s raters demonstrated high scoring accuracy and inter-rater agreement , making the scores awarded in the Writing and Speaking sections of STAMP accurate and reliable for their intended purposes.
如何对写作和口语水平进行评分
The study examined the Écriture and Parler sections of STAMP, scored by trained raters using STAMP levels from 0 (No Proficiency) to 8 (Advanced-Mid).
Examinees respond to three real-world prompts, showcasing their skills. Certified raters independently score each response, backed by rigorous training and ongoing monitoring to ensure accuracy and consistency.
For 80% of responses, a single rater’s score is official. For the remaining 20%, at least two raters score the response, with a manager resolving disagreements. Raters work independently, ensuring unbiased results. The final Écriture or Parler scores seen in our reports reflects the highest level consistently demonstrated across at least two of the three prompts.
下图说明了这一过程:
As shown in Figure 1, an examinee’s official STAMP level is determined by the highest level they can consistently demonstrate in at least two out of three responses. For example, if an examinee receives Novice-Mid for their first response, Débutant-Élevé for their second, and Débutant-Élevé for their third, their final STAMP level is STAMP 3 (Novice-High). Alternatively, if they receive Intermédiaire-Faible for the first response, Débutant-Élevé for the second, and Intermédiaire-Moyen for the third, their final level is Intermédiaire-Faible, as it is the highest level they sustained twice (in the first and third responses).
在 STAMP 的写作和口语部分使用三个独立的提示有两大好处:
- 主题覆盖面更广:对不同主题的考生进行评估,可确保所评定的能力水平更有可能适用于其他实际情况。
- 尽量减少评分者偏差:与评分方法相结合,使用多重提示有助于减少个别评分者可能出现的评分偏差。
接下来,我们将讨论可靠性和准确性的定义。
可靠性与准确性
可靠性
信度是指测量的一致性(Bachman & Palmer, 1996)。简单地说,它是指如果考生在不同时间再次参加测试或参加不同版本的测试,假定他们的能力没有改变,我们对测试分数保持不变的信任程度。
For example, if an examinee scores Intermédiaire-Faible today and Intermédiaire-Élevé tomorrow, without any change in their knowledge or mental state, it suggests the test may not be highly reliable. Similarly, if an examinee scores Avancé-Bas on one version of a test and Intermédiaire-Moyen on another, it indicates a lack of consistency, pointing to an issue with the test’s reliability.
One factor contributing to a test’s reliability is how it is scored. In the STAMP test, the Lecture and Écoute sections are made up of multiple-choice questions that are scored automatically by a computer. This ensures that if an examinee provides the same answers on different occasions, they will always receive the same score.
However, the Écriture and Parler sections involve scoring by human raters. This means that scores can vary depending on who rates the response. However, with well-trained raters, we expect score variations to be minimal, reducing the impact of leniency, strictness, or potential bias.
准确性
应试者希望他们的分数只反映他们在所测结构(在 STAMP 中为每个语言领域的熟练程度)中的熟练程度。
Accuracy refers to how well the awarded score represents an examinee’s true ability. For example, if an examinee submits a Speaking response at the Intermédiaire-Élevé level but receives an Intermédiaire-Faible score from two raters, the score is inaccurate. Even if two other raters assign Intermédiaire-Faible two months later, the score remains inaccurate, although it is reliable (since it is consistent across raters and over time).
评估评分者评分的可靠性和准确性
当答卷由人工评分员进行评分时(如STAMP 的情况),确保评分反映答卷本身的质量而不是评分员的特点至关重要。换句话说,分数应完全取决于应试者表现出的熟练程度,而不是取决于评分者的宽松、严格或偏见。
语言测试提供者经常使用统计数据来说明评分者的不同会导致分数的差异。通常情况下,这需要比较两个不同评分者对同一答案的评分。理想情况下,评分者的意见应尽可能一致,这表明评分过程是可靠的。
However, reliability must also be accompanied by accuracy. Two raters may assign the same score, but both could be incorrect. In a well-developed test, the goal is for raters to consistently agree and be accurate in their scoring.
人类评定者之间的完全一致并不总是现实的。尽管接受过培训并具备专业知识,但即使是合格的评分员有时也会出现分歧--就像医生、工程师或科学家一样。我们的目标是实现高度的一致性,同时考虑到评分的预期用途,这种一致性是站得住脚的。
Below are the statistical measures we use at Avant to evaluate the quality of ratings provided by our raters. While many companies report only exact and adjacent agreement, we assess additional measures to get a comprehensive view of rating quality. The measures reported in this paper include:
确切协议:
该指标以百分比的形式报告,表示在所分析的整个数据集中,评分者 1 对给定答卷的评分等级与评分者 2 对给定答卷的评分等级完全相同时所占的百分比。例如,如果评分者 1 对某一反应评定了 STAMP 5 级,而评分者 2 也对同一反应评定了 STAMP 5 级,这将被视为完全一致的情况。Feldt 和 Brennan(1989 年)建议,在使用两名评分员时,精确一致度至少应达到 80%,70% 的精确一致度在实际操作中是可以接受的。
该指标以百分比的形式报告,显示在整个数据集中,评分者 1 和评分者 2 给出相同等级的频率。例如,如果两个评分者都将STAMP 级别定为 5,则算作完全一致。根据 Feldt 和 Brennan(1989 年)的说法,精确一致度至少应达到80%,在实际操作中,70%是可以接受的。
This same measure can also be used to compare the score assigned by Rater 1 to the official score a response receives after being rated by at least two raters. This is the case employed in the Overview Chart below.
精确 + 相邻协议:
该指标以百分比的形式报告,显示在整个数据集中,评分者 1 和评分者 2 对某一回答给出相同或相邻等级的频率。
例如,STAMP 5 级与4 级和6 级相邻。如果评分者 1 给出了第 4级,评分者 2 给出了第 5 级,由于这两个级别相邻,因此也算在这一测量中。根据 Graham 等人(2012 年)的研究,当一个评分量表有超过5-7 个等级时,如STAMP量表,精确+相邻的一致性应接近90%。
This same measure can also be used to compare the score assigned by Rater 1 to the official score a response receives after being rated by at least two raters. This is the case employed in the Overview Chart below.
二次加权卡帕(QWK)
Cohen's kappa (𝜅)衡量两个评分者之间的可靠性,同时考虑到偶然一致的可能性。例如,在STAMP 9 分量表(从0 级到 8 级)中,两个评分者在评分上达成一致的概率为11.11%。在 Avant在计算 kappa 时,我们还使用了二次加权法,即分数之间的差异越大,惩罚越重。例如,STAMP 3 级和7 级之间的差异比3 级和4 级之间的差异问题更大。
Williamson 等人(2012 年)建议二次加权卡方值(QWK)应≥0.70,而Fleiss(2003 年)则指出,高于0.75 的值表示超出偶然的极佳一致性。QWK值为0意味着完全出于偶然的一致,而值为1则表示完全一致。
标准化平均差 (SMD)
该指标显示两个评分者使用评分量表的相似程度。它比较的是两组分数(评分者 1 与评分者 2)的平均值之差,并以这些分数的集合标准差进行标准化。理想情况下,两位评分者都不应偏好或回避量表中的某些等级(例如,回避STAMP 0或STAMP 8)。换句话说,两位评分者都应使用量表的全部范围(STAMP 0 - STAMP 8),分数应反映出回答中表现出的熟练程度。该指标的建议值为 ≤0.15(Williamson 等人,2012 年),表明两组分数的分布相似度可以接受。
斯皮尔曼等级相关性 (ρ)
这一指标显示了两个变量之间的关联强度:评定者 1 评定的STAMP 等级和评定者 2 评定的等级。如果评分者训练有素,并且理解评分标准,我们就会期望两位评分者给出相似的等级--这意味着分数应该一起移动。换句话说,当评分者 1给出高分时,评分者 2也应给出高分,这反映了对同一结构的一致评价。
我们使用斯皮尔曼秩相关系数而非皮尔森 相关系数,是因为斯皮尔曼 相关系数更适用于序数数据,如STAMP 能力水平。在大多数领域,0.80或以上的相关系数被认为是强相关系数(Akoglu,2018)。
相差 2 个 STAMP 级别
该指标以百分比表示,显示对同一答复的两个评分相差两个 STAMP 等级的频率(例如,评分者 1给出的STAMP 等级为 4,而评分者 2给出的STAMP 等级为6)。
详细分数统计
We now focus on the quality of the ratings for the Écriture and Parler sections of STAMP 4S and STAMP WS, considering the statistics above across several representative languages. Below, we present results based on two different sets of comparisons:
评分者 1 与评分者 2
在至少由两名评定者评定的众多答卷中,我们将评定者1评定的STAMP 等级与评定者2评定的STAMP 等级进行了比较。这种比较证明了两名随机分配的评分者所做评分的可靠性。 Avant评分者的评分的可靠性。如前所述,两名评分员可能会在评分上达成一致,但两人的评分仍可能不正确。因此,我们不包括评分者 1和评分者 2 之间的精确一致性测量。相反,我们将重点放在精确+邻近一致上,并将评分者 1(80% 的时间都是独自评分)的评分与官方评分进行比较,以报告准确度。
评分人 1 与官方评分
为了评估Avant 评分者所打分数的准确性,我们分析了由两名或两名以上评分者对一个答卷进行评 分的情况。我们将官方评分(由所有个人评分得出)与评分者 1单独给出的评分进行比较。这有助于说明当只有一名评分员参与评分时(80%的情况下都是如此),对某一答复的评分准确度如何。
Tables 1 and 2 present the statistical measures for the Écriture and Parler sections of five representative STAMP 4S languages.
Table 1 – Writing Score Accuracy (STAMP 4S)
| 测量 | Arabe | Espagnol | français | 简体中文 | Russe |
|---|---|---|---|---|---|
| 数据集中的回复数量 | n = 3,703 | n = 4,758 | n = 4,785 | n = 4,766 | n = 3,536 |
| Exact Agreement (Rater 1 vs. Official Score) | (84.8%) | (84.15%) | (83.66%) | (88.46%) | (92.17%) |
| Exact + Adjacent Agreement Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 96.78% (98.62%) | 99.09% (99.79%) | 99.22% (99.79%) | 99.79% (99.91%) | 99.71% (99.88%) |
| Quadratic Weight Kappa (QWK) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.93 (0.96) | 0.91 (0.95) | 0.91 (0.95) | 0.95 (0.96) | 0.95 (0.97) |
| Standardized Mean Difference (SMD): Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.00 (0.01) | 0.00 (0.00) | 0.00 (0.00) | 0.00 (0.00) | 0.00 (0.00) |
| Spearman’s Rank-Order Correlation ®: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.94 (0.96) | 0.90 (0.95) | 0.91 (0.95) | 0.95 (0.97) | 0.94 (0.97) |
| 2 STAMP Levels Apart: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 2.80% (1.24%) | 0.90% (0.20%) | 0.77% (0.20%) | 0.00% (0.00%) | 0.28% (0.11%) |
Table 2 – Speaking Score Accuracy (STAMP 4S)
| 测量 | Arabe | Espagnol | français | 简体中文 | Russe |
|---|---|---|---|---|---|
| 数据集中的回复数量 | n = 3,363 | n = 4,078 | n = 4,530 | n = 4,651 | n = 3,392 |
| Exact Agreement (Rater 1 vs. Official Score) | (84.96%) | (80.37%) | (80.19%) | (82.24%) | (88.30%) |
| Exact + Adjacent Agreement: Rater 1 vs. Rater (Rater 1 vs. Official Score) | 96.07% (98.13%) | 98.13% (99.29%) | 98.54% (99.47%) | 99.31% (99.76%) | 98.99% (99.94%) |
Quadratic Weight Kappa (QWK): Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.92 (0.95) | 0.92 (0.96) | 0.91 (0.95) | 0.94 (0.95) | 0.92 (0.96) |
| Standardized Mean Difference (SMD): Rater 1 vs. Rater 2 (Rater 1 vs. Official ) | -0.02 (0.01) | 0.00 (0.00) | -0.01 (0.02) | 0.00 (0.00) | -0.01 (-0.01) |
| Spearman’s Rank-Order Correlation®: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.93 (0.96) | 0.91 (0.95) | 0.92 (0.95) | 0.94 (0.96) | 0.91 (0.95) |
| 2 STAMP Levels Apart: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 3.27% (1.42%) | 1.74% (0.00%) | 1.39% (0.00%) | 0.00% (0.00%) | 1.01% (0.00%) |
表 3 和表 4 显示了三种具有代表性的 STAMP WS 语言的写作和口语部分的统计量。
表 3 写作得分准确率(STAMP WS)
| 测量 | Amharique | Créole haïtien | Vietnamien |
|---|---|---|---|
| 数据集中的回复数量 | n = 209 | n = 125 | n = 1,542 |
| Exact Agreement (Rater 1 vs. Official Score) | 95.79% | 94.69% | 94.38% |
| Exact + Adjacent Agreement Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 99.52% (100%) | 97.60% (100%) | 98.57% (99.02%) |
| Quadratic Weighted Kappa (QWK) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.98 (0.99) | 0.97 (0.99) | 0.96 (0.97) |
| Standardized Mean Difference (SMD) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | -0.01 (0.00) | 0.02 (-0.02) | -0.01 (0.01) |
| Spearman’s Rank-Order Correlation®: Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.98 (0.99) | 0.97 (0.99) | 0.97 (0.98) |
| 2 STAMP Levels Apart Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.00% (0.00%) | 2.40% (0.00%) | 0.00% (0.00%) |

表 4 口语评分准确率(STAMP WS)
| 测量 | Amharique | Créole haïtien | Vietnamien |
|---|---|---|---|
| 数据集中的回复数量 | n = 225 | n = 132 | n = 1,180 |
| Exact Agreement (Rater 1 vs. Official Score) | (96.21%) | (97.91%) | (97.01%) |
| Exact + Adjacent Agreement Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 100% (100%) | 100% (100%) | 99.83% (99.83%) |
| Quadratic Weighted Kappa (QWK) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.99 (0.99) | 0.99 (0.99) | 0.99 (0.98) |
| Standardized Mean Difference (SMD) Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.00 (0.00) | 0.00 (0.00) | 0.00 (0.01) |
| Spearman’s Rank-Order Correlation® Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.99 (0.99) | 0.99 (0.99) | 0.98 (0.99) |
| 2 STAMP Levels Apart (Rater 1 vs. Rater 2 (Rater 1 vs. Official Score) | 0.00% (0.00%) | 0.00% (0.00%) | 0.00% (0.00%) |

讨论
A high level of reliability and accuracy is fundamental to the validity of test scores and their intended uses. What is deemed minimally acceptable in terms of reliability and accuracy will however, depend on the specific field (medicine, law, sports, forensics, language testing, etc), as well as on the consequences of awarding an inaccurate level to a specific examinee’s set of responses, and on the rating scale itself. For example, agreement will tend to be lower the higher the number of categories available in a rating scale. In other words, more disagreement between any two raters can be expected if they must assign one of ten possible levels to a response than if they must assign one of only four possible levels.
上述 STAMP 4S 和 STAMP WS 中写作和口语部分的统计数据显示,这两个部分的信度(评分人 1 与评分人 2 的分数对比)和准确度(评分人 1 与官方分数对比)都很高。
在所评估的八种语言中,评分者 1 和评分者 2 之间的精确+相邻一致性所显示的信度始终保持在最低水平(通常更高),写作为 96.78%,口语为 96.07%。
此外,很少出现两名评分员的评分相差两个 STAMP 等级以上的情况。从评分者 1 的评分和官方评分之间的精确一致统计来看,所有八种语言的准确度都很高,其中写作评分的精确一致率最低为 83.66%(但往往更高),口语评分的精确一致率最低为 80.19%,写作评分的精确+邻近一致率最低为 98.62%,口语评分的精确+邻近一致率最低为 98.13%。二次加权卡帕(QWK)值显示,评分者 1 与评分者 2 之间以及评分者 1 与官方评分之间的一致性都非常高,而评分者 1 与评分者 2 之间以及评分者 1 与官方评分之间的相关性也非常高。最后,SMD(标准化均值差异)系数表明,Avant 评定员使用 STAMP 量表的方式非常相似。
上述统计数据证明,Avant Assessment 的评分员遴选和培训计划质量很高,也证明了我们在确定可能需要从评分员库中暂时删除并进行有针对性培训的操作评分员时所采用的方法。它表明,当任何两名评分员在给某一答卷指定的 STAMP 等级上可能存在差异时,这种差异很少会超过一个 STAMP 等级,在绝大多数情况下,两名评分员都会指定完全相同的等级。再加上考生在 STAMP 写作或口语部分的最终正式分数是基于他们在三个独立提示中的 STAMP 分数。
这些结果有力地证明,考生在 STAMP 考试中写作和口语部分的最终得分可以可靠、准确地反映他们在这两个领域的语言水平。
参考资料
Akoglu, H. (2018).相关系数用户指南》。土耳其急诊医学杂志》,18(3),91-93。
Bachman, L. F., & Palmer, A. S. (1996).Language Testing in Practice:Designing and developing useful language tests (Vol. 1).牛津大学出版社。
Feldt, L. S., & Brennan, R. (1989).Reliability.In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 105-146).New York:New York: Macmillan.
Fleiss, J. L., Levin, B., & Paik, M. C. (2003).率和比例的统计方法》。3rd ed. Wiley.
Graham, M., Milanowski, A., & Miller, J. (2012)。衡量和促进
教师和校长绩效评分的评分者之间的一致性。
Matrix Education (2022)。物理实践技能第二部分:实验的有效性、可靠性和准确性》。 2022 年 8 月 11 日检索(点击此处转至来源)。
Williamson, D. M., Xi, X., & Breyer, F. J. (2012).自动
评分的评估和使用框架。教育测量:问题与实践》,31(1),2-13。

