Fourth International Language-For-All Conference, Gaziantep, Türkiye, 16 - 17 Ekim 2025, (Yayınlanmadı)
As artificial intelligence becomes increasingly integrated into language education, the reliability and pedagogical value of AI-assisted writing assessment tools demand closer scrutiny. This study explores the grading performance of two language models—WritingGPT (a fine-tuned variant of GPT) and standard ChatGPT—compared to human raters in the context of university-level English preparatory programs. The participants were students enrolled in academic writing courses designed to develop foundational skills in English composition. This research is guided by two central questions: (1) How closely do WritingGPT and ChatGPT align with human evaluators in assessing student essays? (2) Can these models be considered reliable and valid tools for use in educational writing assessment? The essays were scored by trained human raters, WritingGPT, and ChatGPT. The results indicate that WritingGPT showed a strong alignment with human grading trends and mirrored their evaluative logic, though it consistently assigned more generous scores. ChatGPT, by contrast, demonstrated a weaker and more inconsistent relationship with human assessments, suggesting lower reliability. Furthermore, statistical analysis confirmed that the differences between WritingGPT and human grading were systematic and significant, not due to random variation. These findings suggest that while a fine-tuned model like
WritingGPT can support formative assessment in writing instruction, its outputs should be interpreted cautiously and always accompanied by human oversight. The study highlights the necessity of domain-specific tuning and ethical considerations when integrating AI technologies into language education contexts.