Evaluación de desempeño en PIS: Diseño de rúbricas y fiabilidad interevaluador

Margarita Johanna Rocafuerte López; Filomena Agripina Quezada Sanmartin; Nidia Sharon Torres Aguilar; Esperanza Narcisa Mejía Ruiz

doi:10.64747/xeasmv38

Authors

Margarita Johanna Rocafuerte López EEB Edison Mendoza Enríquez Author https://orcid.org/0009-0005-0396-7692
Filomena Agripina Quezada Sanmartin UE 12 de Febrero Author https://orcid.org/0009-0005-9524-5565
Nidia Sharon Torres Aguilar Universidad Internacional de la Rioja Author https://orcid.org/0009-0000-0079-9156
Esperanza Narcisa Mejía Ruiz EEB Edison Ernesto Mendoza Enríquez Author https://orcid.org/0009-0007-6768-5234

DOI:

https://doi.org/10.64747/xeasmv38

Keywords:

analytical rubrics, inter‑rater reliability, Gwet’s AC1, ICC, generalizability theory

Abstract

Objective: To examine inter‑rater reliability of an analytic rubric for Integrated Knowledge Projects (PIS) in 12th grade and to outline design decisions for formative and summative uses. Methods: Observational study in public urban schools in Guayaquil (−2.170997°, −79.922359°). A total of 400 student products (reports, presentations, prototypes) were scored by 4–6 raters per product using a six‑criterion, four‑level rubric. A pilot (n≈40) informed descriptor refinement and variance estimation. The protocol comprised training and calibration with an anchored bank. Analyses included Gwet’s AC1 per criterion (95% CI), ICC(2,k) for the total score (0–100), mixed‑effects models for product‑type comparisons, and, in a subsample (n≈120), generalizability studies (G‑coefficient and phi), plus sensitivity checks (re‑weighting, adjudication exclusion, and re‑calibration). Results: AC1 ranged from 0.72 to 0.84, higher for technical criteria (Evidence/Data; Rigor) and lower for interpretive dimensions (Collaboration; Ethical Impact and Feasibility). Overall ICC(2,k) reached 0.88 (mean k≈4.8; ICC(2,1)=0.69); the performance‑by‑k curve showed diminishing returns after k=5, with an operational “sweet spot” at k=4–5. The p×i×r subsample yielded G=0.86 and phi=0.82; most error variance was due to product×rater interactions. Prototypes scored lower than reports and presentations, especially on Communication and Impact. Sensitivity analyses supported metric stability and showed reduced “rater” variance after re‑calibration. Conclusions: The PIS‑BGU rubric demonstrates good‑to‑excellent reliability for the total score and moderate‑to‑high reliability by criterion, supporting both summative decisions and formative feedback. Recommendations include using k=4–5 raters, protocolized adjudication when discrepancies exceed one level, and strengthened calibration for interpretive criteria. Future work will incorporate models with rater severity and external validation in rural settings and other STEAM areas

References

Brennan, R. L. (2001). Generalizability theory. Springer. https://doi.org/10.1007/978-1-4757-3456-0

Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558. https://doi.org/10.1016/0895-4356(90)90159-M

De la Cruz Cruz, M. R., Lara Jerez, B. O., Almeida, M. E., & Mafla Álvarez, A. M. (2025). Evaluación formativa con rúbricas analíticas en resolución de problemas. Horizonte Científico International Journal, 3(2), 1–18. https://doi.org/10.64747/a93zv304

Duarte Ortiz, J. del C., Gordillo Ronquillo, A. M., Orellana Romero, B. P., & Vera Letechi, J. E. (2025). Tecnología, modelos pedagógicos y desempeño académico: análisis en instituciones educativas de Loja y Guayaquil. Horizonte Científico International Journal, 3(2), 1–14. https://doi.org/10.64747/aj9hhg57

Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549. https://doi.org/10.1016/0895-4356(90)90158-L

Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34. https://doi.org/10.20982/tqmp.08.1.p023

Konstantinidis, M., Potamias, G., Karampelas, P., & Fotiadis, D. I. (2022). An empirical comparative assessment of inter-rater agreement measures. Symmetry, 14(2), 262. https://doi.org/10.3390/sym14020262

Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46. https://doi.org/10.1037/1082-989X.1.1.30

Montero Anzuat, C. A., Montezuma Monar, R. B., Valdiviezo Puchaicela, F. E., & Yar Pilamunga, G. J. (2025). Resolución de problemas contextualizado mediante modelación matemática: efectos en el pensamiento crítico y la transferencia cognitiva. Horizonte Científico Educativo International Journal, 1(2), 1–12. https://doi.org/10.64747/dj2m7h71

Ohyama, T. (2021). Statistical inference of Gwet’s AC1 coefficient for multiple raters and binary outcomes. Communications in Statistics—Theory and Methods, 50(14), 3564–3572. https://doi.org/10.1080/03610926.2019.1708397

Rodríguez Ruiz, M. F., & Posligua Garcia, D. M. (2025). Evaluación formativa con rúbricas digitales en Ciencias Naturales: impacto en aprendizaje por indagación en 7.º–10.º EGB. Horizonte Científico Educativo International Journal, 1(2), 1–15. https://doi.org/10.64747/emgnq411

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428. https://doi.org/10.1037/0033-2909.86.2.420

Tan, K. S., Yeh, Y.-C., Adusumilli, P. S., & Travis, W. D. (2024). Quantifying interrater agreement and reliability between thoracic pathologists: Paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clinical and Research Reports, 5, 100618. https://doi.org/10.1016/j.jtocrr.2024.100618

Tong, F., Tang, S., Irby, B. J., Lara-Alecio, R., & Guerrero, C. (2020). The determination of appropriate coefficient indices for inter-rater reliability: Using classroom observation instruments as fidelity measures in large-scale randomized research. International Journal of Educational Research, 99, 101514. https://doi.org/10.1016/j.ijer.2019.101514

Vach, W., & Gerke, O. (2023). Gwet’s AC1 is not a substitute for Cohen’s kappa—A comparison of basic properties. MethodsX, 10, 102212. https://doi.org/10.1016/j.mex.2023.102212

Webb, N. M. (2014). Generalizability theory: Overview. In Wiley StatsRef: Statistics Reference Online. Wiley. https://doi.org/10.1002/9781118445112.stat06729

Xu, M., Li, Z., Mou, K., & Shuaib, K. M. (2023). Homogeneity test of the first-order agreement coefficient in a stratified design. Entropy, 25(3), 536. https://doi.org/10.3390/e25030536

Performance Assessment in Integrated Knowledge Projects (PIS): Rubric Design and Inter‑Rater Reliability

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Information

Language