Why can the final grade of an exam be unreliable?

access to certain studies and educational centers or the achievement of awards and honorable mentions. the final grade of an exam.

Why can the final grade of an exam be unreliable?

Foresee most educational systems, moments of transition or completion of studies that are marked by tests to obtain an academic degree, access to certain studies and educational centers or the achievement of awards and honorable mentions. the final grade of an exam.

These are tests that have a great impact on the academic, personal and professional future of people, so they must be done with rigor and objectivity to guarantee fairness and promote merit and ability.

The authentic assessment

In the last quarter of the last century an educational trend emerged that proposed that these high-impact tests be authentic assessments. This type of test was designed as an alternative to the classic multiple choice formats, made up of questions with several alternatives from which the evaluated person has to choose the correct one.

Instead, authentic assessment advocates using performance or open evidence, such as written essays, oral presentations, artistic or physical performances, case resolution, reporting, presentations, portfolios, or projects. It is understood that this type of evaluation is more authentic than that carried out by means of a test, hence the name it receives.

The two fundamental advantages would be, on the one hand, its greater realism, allowing to recreate or simulate the real conditions of professional and academic performance and, on the other, the possibility of evaluating complex competences such as creativity, autonomy, critical capacity, argumentative organization or teamwork, difficult to fit into the multiple-choice test format.

The canons of rigor

The idea is good, the problem comes when it comes to objectifying these authentic evaluations, because hell is paved with good intentions. Objectivity and fairness are not negotiable, so the authentic evaluation will have to demonstrate that it follows the standards of evaluative rigor required for high-impact tests.

For this reason, as these types of evaluations became general, abundant studies were also developed on the reliability of the ratings given by the evaluators.

From the first moment, the investigations reveal the existence of clear differences between the scores of the evaluators who qualified the same exercise.

In addition to the differences in terms of severity or benevolence, it was also detected that the evaluators presented a whole catalog of errors, inconsistencies and biases that affected their qualifications.

Ultimately, it was confirmed that the scores were contaminated by the biases introduced by the evaluators, so some researchers reported that the limitations in terms of objective measurement of the open exercises could turn a high-impact test into a lottery, that is, what was gained in realism was lost in objectivity.

Must Read: The Benefits of Riddles in Child Development

Why do the evaluators disagree?

The reasons why two evaluators disagree when rating the same exercise are multiple, although they can be organized into three large blocks:

  1. The subject matter of the evaluation. The review studies carried out indicate that the ratings of oral productions and written essays present lower levels of agreement between judges than physical and manual exercises, the solution of clinical cases or the execution of engineering projects. In the case of school subjects, the degree of agreement between evaluators is greater in the exams in the scientific-mathematical areas than in the linguistic areas, although this fact is conditioned by the level of difficulty of the task.
  2. With abundant evidence of the effect that judges have on grades. Variations have been documented based on their personal characteristics, attitudes, emotional traits, professional trajectory, education, training and previous experience, or behavior and cognitive processes when faced with the task, among others.
  3. The evaluation procedure used. To try to objectify their evaluations, experts use a series of criteria, instructions and guidelines called correction rubrics. Different investigations have found that a poor specification of these rubrics introduces biases in the scores and maximizes the differences between the evaluators.

Can the evaluation be objectified?

First of all, it must be assumed that in authentic assessment tests it is impossible to completely neutralize the subjectivity of the assessors. When high-impact evaluations include this type of evidence, only attempt to minimize the effects of the corrector. How can it be done? We have four main alternatives:

  • Design correction rubrics that describe the evaluation process in an analytical and detailed way, including examples of scoring of real executions in the training of evaluators.
  • Develop systematic training programs for evaluators in the handling of rubrics in order to minimize the differences between them.
  • Set up evaluation courts where the judges are distributed following a systematic pattern that compensates for the potential effect of assigning some exercises to more severe evaluators and others to more benevolent judges.
  • Use psychometric models to estimate the scores of the people evaluated that include, in addition to the responses of the students, the effect of the evaluators, so that the final grade is corrected according to the degree of severity of the evaluating judges and courts.


The idea of ‹‹an authentic assessment that tries to capture all the richness and nuances of academic performance in as realistic situations as possible is highly commendable; the challenge is to find objective evaluation systems for this type of test. Otherwise we leave the people evaluated in the hands of chance, defenselessness and inequity.