# A review of automatic item generation techniques leveraging large language models
[@tanReviewAutomaticItem2025], [[aig|AIG]], [[llm|LLM]].
## Abstract
This study reviews existing research on the use of large language models (LLMs) for automatic item generation (AIG). We performed a comprehensive literature search across seven research databases, selected studies based on predefined criteria, and summarized 60 relevant studies that employed LLMs in the AIG process. We identified the most commonly used LLMs in current AIG literature, their specific applications in the AIG process, and the characteristics of the generated items. We found that LLMs are flexible and effective in generating various types of items across different languages and subject domains. However, many studies have overlooked the quality of the generated items, indicating a lack of a solid educational foundation. Therefore, we share two suggestions to enhance the educational foundation for leveraging LLMs in AIG, advocating for interdisciplinary collaborations to exploit the utility and potential of LLMs.
## Notes
- A three stage framework for AIG
- Pre-generation - better understanding of the item
- Generation - increasing use of LLMs
- Post-generation - measuring the items
- Many studies applying LLMs in AIG lacked a solid educational foundation. The items are not evaluated in pedagogical context.
- Future research should adopt the three-stage framework -- which naturally suits the multi-agent framework.
- Most researches didn't adopt state of the art commercial models. (except some with GPT series from OpenAI)
## Annotations
> As noted in the methodology section, we did not find many studies reporting the measurement properties of the generated items.
> Notably, only 10 out of the 60 studies mentioned “validity”. This was followed by “reliability” and “pedagogical”, which found their places in eight studies. Other keywords were used even less frequently in the reviewed studies. The infrequent occurrences of these keywords signal a concerning issue: the majority of the reviewed studies seem to neglect measurement properties of items when generating items for educational purposes, which potentially impacts the validity and reliability of the assessment results. Moreover, this could result in generating questions that are too simple and do not require higher cognitive thinking to answer, failing to meet the measurement or pedagogical purposes (e.g., delivering feedback or evaluating achievement).
> we found that many studies applying LLMs in AIG often lacked a solid educational foundation. This might be because many of the authors were NLP researchers who possessed limited recognition and knowledge of learning or measurement theories. Alternatively, it could be because creating high-quality items that are readily usable for educational contexts was not their primary interest or research focus. Accordingly, many of those items are generated without deep consideration of their measurement purposes and item properties, which are essential to meet the requirements of educational assessment.
> For example, many of those generated items in the existing studies do not attempt to evaluate the higher-level cognitive processes specified in Bloom’s taxonomy, such as applying, analyzing, evaluating, or creating.
> Moreover, while some studies invited human participants to evaluate the quality of the items after being generated, only a few involved SMEs or measurement specialists in the AIG process.
> The absence of expert guidance has led many existing AIG studies using LLMs to overlook important measurement properties such as item difficulty and item discrimination. Thus, the current literature mostly provides evidence that LLMs can be leveraged to generate a large number of items, but little is known about whether these items possess the high quality necessary for educational purposes such as pedagogical teaching and assessment.
> we argue that a thorough item evaluation after generation is missing in the current literature.
> First, future research should prioritize clarifying the assessment context and measurement goals in AIG applications.
> Second, we recommend evaluating both the measurement properties and pedagogical soundness of generated items as an essential step in AIG.
> The three-stage AIG framework emerging from our analysis (i.e., pre-generation stage, item generation stage, post-generation stage) offers a structured approach for integrating these two recommendations into practice.