Christmas office closure: We’re closing for the Christmas break from 5pm on Monday 23 December 2024 until 8:30am on Thursday 2 January 2025. Learn more.
Our Assessment Innovation Fund pilots: The Open University
Shaping the future of learning and assessment
We’re on a mission to break boundaries in assessment with an investment fund to support and pilot new ideas on the future of assessment.
The Open University: Developing robust assessment in the light of Generative AI developments
The Open University was awarded just under £45,000 by the Assessment Innovation Fund to carry out research in order to develop evidence-based guidance on the strengths and weaknesses of assessment types in the light of Generative AI (GAI) tools.
About the pilot
Introduction
This study, funded by NCFE’s Assessment Innovation Fund and conducted by researchers from The Open University, addresses the challenges and opportunities posed by Generative Artificial Intelligence (GAI) tools, particularly large language models (LLMs) like ChatGPT, for assessment in further and higher education.
The research had three objectives:
- To identify the most robust and easiest assessment types to be answered by GAI, to enable institutions to risk-assess their assessments and identify the highest risk assessment(s) which may require changes.
- To enable some comparison and sharing of best practice across subject disciplines and levels.
- To assess the effectiveness of a short training programme to upskill educators in recognising scripts containing AI-generated material.
Methodology
- A mixed-methods approach was employed to gather quantitative and qualitative data. The core of the research involved:
- Asking a group of 43 markers to grade 944 scripts, representing 59 examples of 17 different question types, and to flag any scripts they suspected were AI-generated and give their reasons for that suspicion. This generated quantitative data on how well GAI performed, as well as correct and false detection rates of GAI scripts. These results were analysed by question type, level and discipline. The reasons given for GAI detection were key data included in the qualitative research.
- The marking was divided into two batches (472 scripts in each batch) with markers undergoing training between each batch. This generated quantitative data on the impact of training on correct and false GAI detection rates.
- Quizzes and a survey that formed part of the training provided additional quantitative and qualitative data, for example, on markers’ prior experience of GAI, improvement in GAI detection within the context of the training quizzes, and perceptions of the impact of training on their confidence. Open questions also provided further qualitative data.
- Finally, each marker participated in one of a series of focus groups which provided further rich qualitative data on marking, and detection practices and the impact of training.
The research took place between March and July 2024.
Pilot Findings
Research Question, Findings
Which are the most robust and easiest assessment types to be answered by GAI?
- AI tools like ChatGPT can produce high-quality responses, often comparable to student submissions.
- The AI answers for 58 of the 59 questions received passing marks.
- The only exception was an activity-plan question which required specific application of module material supported by unambiguous marking guidance.
Are there any differences across subject disciplines and levels?
- GAI answers scored particularly highly at level 3 (RQF level 3 and FHEQ level 3 for England, Wales and Northern Ireland and SCQF level 6 in Scotland), with subsequent reductions in marks for levels 4 (RQF level 4 and FHEQ level 4 for England, Wales and Northern Ireland and SCQF level 7 in Scotland), 5 (RQF level 5 and FHEQ level 5 for England, Wales and Northern Ireland and SCQF level 8 in Scotland) and 6 (RQF level 6 and FHEQ level 6 for England, Wales and Northern Ireland and SCQF level 9 in Scotland)
- The discipline did not affect the performance of AI in assessments
- However, certain disciplines (Law, Languages, Mathematics and Science) presented unique challenges in AI detection due to the nature of the content (for example, the possibility of machine translation in languages).
How effective is a short training programme to upskill educators in recognising scripts containing AI-generated material?
- Markers’ ability to detect GAI answers improved after receiving training (apart from at level three). However, there was also an increase in false positives, where genuine student answers were mistakenly identified as AI-generated.
- The hallmarks found in GAI answers were also present in student answers, particularly, but not exclusively, weaker students.
- Detection was easier in subjects requiring critical thinking and application of knowledge, and more challenging in descriptive or straightforward tasks.
Conclusions
The 17 assessment types (assessed over 59 questions) used in this research were generally not robust in the face of GAI; either the GAI answers performed well and achieved a passing mark, or marker training increased the number of false positives. The most robust assessment types were Audience-tailored, observation by learner and reflection on work practice, which align with what is often called ‘authentic assessment’[1], although GAI answers to these were still capable of achieving a passing grade.
Given the increase of false positives after training, focusing on detection of GAI answers may not be feasible and will require a lot of institutional capacity. Institutions should therefore focus on assessment design, risk assessing their questions and re-writing those where GAI answers are most likely to perform well and / or the number of false positives is likely to be high. The hallmarks of GAI could inform more robust question and marking guidance design, ensuring that the question and associated marking guidance require students to complete tasks which are more difficult for GAI tools.
For answers which display the hallmarks of GAI, it is irrelevant whether they are due to them being a weaker student’s answers or inappropriate use of GAI. In both cases, students initially need study skills advice and support to improve their general skills and their AI literacy skills, prior to any academic conduct processes. This may require institutions to amend their academic conduct processes and provide GAI training for students.
The report makes a number of recommendations to implement this approach including the enhancement of assessment design, GAI skills development for staff and students, and amendments to the academic conduct process.
This Executive Summary was written initially by ChatGPT4, and subsequently amended and adapted heavily by members of the research team in July 2024.
[1] ‘Authentic assessment’ is a contested term. For an explanation of some of the issues, see for example https://www.qaa.ac.uk/docs/qaas/focus-on/nus-assessment-and-feedback-presentation.pdf?sfvrsn=1743f481_8.
Find out more about upcoming application phases and how to apply
Window 6 closed on March 12 2024
Complete our contact form to subscribe to email updates about upcoming funding windows, download the application guide, scoring criteria and much more.